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SUMMARY 


This  project  developed  transfer  learning  methods  that  enable  teams  of  heterogeneous  agents  to 
rapidly  adapt  control  and  coordination  policies  to  new  scenarios.  Our  approach  uses  a 
combination  of  lifelong  transfer  learning  and  automated  instruction  to  support  continual 
transfer  among  heterogeneous  agents  and  across  diverse  tasks.  The  resulting  system  accumulates 
transferrable  knowledge  over  consecutive  tasks,  enabling  the  transfer  learning  process  to 
improve  over  time  and  the  system  to  become  increasingly  versatile.  We  apply  these  methods  to 
sequential  decision-making  (SDM)  tasks  in  dynamic  environments  with  both  simple  benchmark 
tasks  and  more  complex  aerial  and  ground  robot  tasks. 

Our  work  has  produced  the  following  novel  contributions: 

•  Lifelong  accumulation  of  transferrable  knowledge  for  SDM  tasks:  We  developed  the 
first  general-purpose  framework  for  lifelong  reinforcement  learning  (RL).  This  general 
lifelong  RL  framework  supports  a  wide  variety  of  base  RL  learning  algorithms,  provides 
theoretical  guarantees  on  performance  and  convergence,  and  has  a  computational 
complexity  independent  of  the  number  of  tasks  learned,  ensuring  scalability.  The  system 
also  supports  the  reverse  transfer  of  new  knowledge  to  previously  learned  models, 
enabling  continual  improvement  among  all  agents. 

•  Autonomous  cross-domain  transfer  of  knowledge:  We  developed  several  approaches 
for  enabling  autonomous  cross-domain  transfer  between  diverse  task  domains,  including 
1)  manifold-based  methods  for  transfer  learning,  and  2)  autonomous  cross-domain 
lifelong  learning.  By  projecting  diverse  tasks  into  a  common  space,  our  approach 
supports  transfer  between  domains  and  models  with  differing  feature  and  action  spaces, 
enhancing  the  system’s  ability  to  transfer  between  diverse  SDM  tasks.  We  demonstrated 
the  effectiveness  of  this  approach  by  showing,  for  the  first  time,  cross-domain  lifelong 
learning  between  control  policies  for  different  dynamical  systems  (e.g.,  quadrotors  to 
helicopters  to  bicycles,  etc.) 

•  Robust  control  of  aerial  and  ground  robots  in  the  presence  of  disturbances:  We 

applied  our  methods  to  the  problem  of  learning  controllers  for  robots  with  novel 
disturbances  in  their  sensors  and  actuators.  Results  show  that  new  robots  with  novel 
errors  can  more  rapidly  compensate  for  these  imperfections  by  leveraging  lifelong 
transfer  learning. 

•  Zero-shot  lifelong  transfer  learning  from  high-level  task  descriptions:  We  showed 
that  our  lifelong  learning  methods  can  also  incorporate  high-level  descriptions  of  the 
tasks  to  improve  the  performance  of  the  lifelong  learning  process.  Most  importantly, 
given  a  high-level  description  for  a  novel  task  (e.g.,  the  description  of  a  new  dynamical 
system),  this  approach  was  able  to  synthesize  a  high-performance  controller  for  that  new 
task  without  observing  any  training  data  on  that  new  task. 

Improved  understanding  of  automated  instruction:  Our  work  has  leveraged  crowd  sourcing 
as  a  novel  way  of  collecting  data  for  learning  agents.  We  have  shown  that  naive  users  can 
provide  accurate  advice  to  an  agent  learning  SDM  tasks  in  multiple  settings. 

This  final  report  summarizes  our  most  relevant  findings  over  the  course  of  this  research  grant 
and  briefly  highlights  remaining  open  problems. 
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INTRODUCTION 


Humans  learn  to  solve  increasingly  complex  tasks  by  continually  building  upon  and  refining 
their  knowledge.  Virtually  every  aspect  of  higher-level  learning  and  memory  involves  this 
process  of  knowledge  transfer,  and  it  is  often  credited  with  enabling  all  learning  beyond  simple 
stimulus-response  cycles.  Recent  research  on  transfer  in  machine  learning  seeks  to  duplicate  this 
notion  of  reusing  knowledge  from  a  set  of  previously  learned  source  tasks  to  improve  learning 
on  a  new  target  task.  Each  task  represents  a  single  learning  problem,  such  as  determining  an 
action  to  take  in  a  given  situation  or  recognizing  a  particular  target  in  imagery.  Results  show  that 
transfer  from  previously  learned  models  can  improve  the  performance  of  new  models  in  a  wide 
variety  of  problem  domains,  including  handwriting  recognition,  social  network  analysis,  image 
classification,  maze  navigation,  simulated  robot  soccer,  and  robot  control. 

However,  transfer  learning  research  has  yet  to  enable  the  rapid  and  continual  lifelong  learning  of 
complex  tasks  that  occurs  in  humans.  Instead,  current  research  focuses  on  isolated  transfer 
scenarios  to  a  single  target  task  (known  generally  as  transfer  learning )  or  the  simultaneously 
learning  of  multiple  tasks  (known  as  multi-task  learning ),  showing  little  consideration  for  the 
intricacies  of  learning  multiple  tasks  presented  in  sequence  across  multiple  domains,  as  occurs  in 
human  learning.  Such  situations  occur  constantly  in  operational  scenarios  where  the  focus  shifts 
between  tasks  due  to  changing  mission  requirements  and/or  situational  awareness.  Autonomous 
systems  have  the  additional  advantage  that  multiple  agents  can  efficiently  share  knowledge, 
enabling  transfer  learning  to  provide  collective  improvement  to  the  system. 

Our  research  has  developed  transfer  learning  methods  that  enable  teams  of  heterogeneous  agents 
to  rapidly  adapt  control  and  coordination  policies  to  new  scenarios.  Our  approach  uses  a 
combination  of  lifelong  transfer  learning  and  automated  instruction  to  support  continual  transfer 
among  heterogeneous  agents  and  across  diverse  tasks,  as  described  in  the  scenario  above.  These 
developed  methods  have  been  applied  to  complex  sequential  decision-making  (SDM)  tasks  using 
aerial  and  ground  robots. 


METHODS,  ASSUMPTIONS,  AND  PROCEDURES 

Our  approach  was  centered  on  two  major  thrusts:  1)  lifelong  transfer  learning  for  sequential 
decision  making,  and  2)  automated  instruction.  These  methods  were  developed  to  support 
continual  transfer  among  heterogeneous  agents  and  across  diverse  tasks,  as  demonstrated  in  their 
application  to  multi-agent  search  and  retrieval  using  aerial  and  ground  robots.  This  section 
summarizes  our  key  developments  in  each  of  these  areas,  with  specific  results  presented  in  the 
next  section. 


Continual  Transfer  in  Lifelong  Reinforcement  Learning 

We  developed  the  first  general-purpose  lifelong  RL  framework,  which  is  capable  of  learning 
multiple  consecutive  RL  tasks,  and  autonomously  sharing  knowledge  between  the  tasks  (Figure 
1).  Our  approach  is  based  on  upon  the  Efficient  Lifelong  Learning  Algorithm  (ELLA) 
framework  [Ruvolo  and  Eaton,  2013a,  2013b]  (see  Appendix  A  for  details)  which  was  originally 
developed  for  classification  and  regression.  Our  lifelong  RL  framework  can  support  various  base 
RL  algorithms  (TD-leaming,  Q-leaming,  policy  gradients,  etc.),  providing  a  mechanism  to  adapt 
existing  RL  methods  to  the  lifelong  learning  setting.  Most  notably,  we  developed  PG-ELLA  for 
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lifelong  policy  gradient  learning  [Bou  Ammar  et  al.,  2014]  and  GTD-ELLA  for  lifelong 
gradient  TD-learning  [Sreenivasan  et  al,  2014], 


Time 

.  i 

•  •  • 

t-3i 

previously  learned  tasks 

1.)  Tasks  are  received 
consecutively 


•  --*0 


future  learning  tasks  new  task 
domain 


Lifelong  Learning  System 


Figure  1:  General  framework  and  process  for  lifelong  reinforcement  learning,  in  this  case  shown  with  multiple 

task  domains. 


Both  of  these  approaches  assume  a  factored  representation  of  the  learned  knowledge  to  facilitate 
transfer  (Figure  2).  These  approaches  employ  parameterized  RL  policies,  representing  the  policy 
( f ,  ( f  .  for  each  task  as  a  parameter  vector.  To  facilitate  transfer 

0  _ ^  _  s  between  tasks,  our  approach  learns  a  shared  sparse  basis  over 

jjl  3  the  parameterized  policies,  where  each  basis  component 

PL  B-  -  ~  effectively  represents  a  chunk  of  knowledge  that  captures 

]  —  4-;  m  X  ;g  regularities  within  the  space  of  policies.  The  sparse  basis  is 

g  then  shared  between  all  tasks,  facilitating  transfer  between  the 
°  policies  as  each  of  them  is  reconstructed  in  the  shared  basis. 
Given  sampled  trajectories  for  a  new  task,  we  estimate  the 
policy  for  that  new  task  using  a  few  steps  of  RL,  and  then 
reconstruct  the  estimated  policy  in  the  shared  basis, 
transferring  knowledge  through  the  reconstruction.  We  show 
in  the  next  section  that  this  approach  is  highly  effective  for 
lifelong  learning  of  control  policies  for  a  variety  of  dynamical 
systems  and  applications  to  ground  vehicle  and  quadrotor  control.  Lifelong  RL  provides  a 
significant  jumpstart  in  initial  performance  and  strong  performance  gains  over  learning  the  new 
task  in  isolation  through  single  task  learning. 


I 


Source 

Knowledge 


Figure  2:  Factored  representation  of 
learned  model. 


We  also  developed  a  fully  online  version  of  PG-ELLA  that  has  reduced  computational 
complexity  and  exhibits  sublinear  regret,  thus  providing  strong  theoretical  guarantees  [Bou 
Ammar  et  al.,  2015b]  (see  Appendix  C  for  details).  The  fully  online  variant  can  also  incorporate 
safety  constraints  into  the  policy  search  process,  allowing  it  to  produce  safe  policies.  This 
capability  is  especially  important  for  practical  deployment  of  lifelong  transfer  learning, 
effectively  guaranteeing  that  the  policies  it  produces  via  transfer  will  obey  given  safety 
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constraints.  The  form  of  these  safety  constraints  is  particularly  versatile,  enabling  constraints  on 
(for  example)  allowable  states,  motion  trajectories,  and  smoothness  of  transitions. 


While  all  of  these  methods  are  highly  effective,  they  are  limited  to  transfer  between  homogenous 
agents  (e.g.,  from  one  quadrotor  controller  to  another  quadrotor  controller).  To  remedy  this 
problem,  we  developed  a  mechanism  to  enable  autonomous  cross-domain  transfer  in  lifelong 
learning  systems  [Bou  Ammar  et  al.,  2015c]  (see  Appendix  D  for  details).  To  enable  autonomous 
cross-domain  transfer,  we  added  another  layer  of  factorization  to  the  lifelong  RL  framework 
(Figure  3).  The  system  learns  a  single  shared  basis  that  underlies  all  task  domains,  and  then 
learns  a  set  of  projection  matrices  that  specialize  that  shared  basis  into  a  domain-specific  basis 
for  each  task  domain.  Next,  the  domain-specific  basis  is  used  to  factorize  the  learned  policies  for 
tasks  from  that  domain.  This  mechanism  allows,  for  the  first  time,  effective  transfer  between 
diverse  tasks  (e.g.,  between  controllers  for  cart-poles,  bicycles,  quadrotors,  helicopters,  etc.).  The 
paper  on  this  approach  was  nominated  for  a  best  paper  award  at  IJCAI’15.  In  addition,  to 

understand  the  potential  mechanisms  underlying 
autonomous  cross-domain  transfer,  we 
examined  the  use  of  cross-domain  manifold 
alignment  to  explain  the  mapping  of  state- 
transitions  between  pairs  of  task  domains  [Bou 
Ammar  et  al.,  2015a]  (see  Appendix  B  for 
details).  This  work  revealed  a  strong  correlation 
between  the  quality  of  the  manifold  alignment 
h  between  two  task  domains  and  the  quality  of  the 

shared  Knowledge  Base  transferred  knowledge,  providing  a  potential 

Figure  3:  Two-layer  factorized  framework  for  mechanism  for  predicting  the  effectiveness  of 

autonomous  cross-domain  transfer  between  different  transfer  leamin§  (and  thereby  avoiding  negative 
task  domains.  transfer). 
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One  major  issue  with  this  lifelong  learning  work  is  that  the  learner  must  gain  experience  in  a  new 

task  before  it  can  perform  transfer.  To  remedy  this  situation,  we  also  incorporated  high-level  task 
descriptions  into  the  lifelong  learning  process  via  coupled  dictionary  learning  [Isele  et  al, 
2016a],  We  showed  that  these  high-level  task  descriptors  can  improve  lifelong  learning 
performance.  Most  importantly,  given  the  high-level  task  descriptor  for  a  new  task,  our 
approach  can  synthesize  a  high-performance  controller  for  that  task  before  gaining  any 
experience  interacting  with  that  task,  a  process  known  as  zero-shot  lifelong  transfer  learning. 


Quantifying  Negative  Transfer 

Many  existing  methods  for  transfer  learning  rely  exclusively  on  empirical  validation,  lacking 
formal  theoretical  guarantees.  Additionally,  the  majority  of  methods  require  the  availability  of  a 
(near-)  optimal  teacher  in  order  to  help  the  student  learn.  We  have  addressed  these  drawbacks  by 
proposing  a  new  framework  for  policy  advice  [Zhan  et  al.,  2016]  (see  Appendix  E  for  details). 
Our  framework  formally  generalizes  current  single-teacher  advice  models  to  the  multi-teacher 
setting.  Our  novel  algorithm  also  remedies  the  need  for  optimal  teachers  by  exploiting  both 
the  student’s  and  the  teacher’s  knowledge.  Even  if  the  teacher  is  not  optimal,  a  student,  using  our 
algorithm,  is  still  capable  of  acquiring  optimal  behavior  in  a  task;  a  property  not  supported  by 
many  state-of-the-art  methods,  e.g.,  learning  from  demonstration.  We  theoretically  and 
empirically  analyze  the  performance  of  the  proposed  method  and  derive,  for  the  first  time,  regret 
bounds  quantifying  the  successfulness  of  action  advice. 
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These  contributions  can  be  summarized  as: 

•  Formally  defining  multi-teacher  advice  models, 

•  introducing  novel  algorithms  leveraging  teacher  and  student  knowledge, 

•  deriving  the  regret  analysis  showing  reduced  sample  complexities, 

•  deriving  theoretical  guarantees  for  single  teacher  advice  models,  and 

•  quantifying  negative  transfer  under  such  advice  model. 

These  theoretical  results  justify  a  well-known  intuition  inherent  to  advice  models:  “good  teachers 
help  while  bad  teachers  hurt.”  The  results  show  that  students  can  still  achieve  optimal  behavior 
when  being  advised  by  bad  teachers.  They,  however,  pay  an  extra  cost  in  terms  of  their  learning 
times  or  sample  complexities,  relative  to  an  optimal  teacher. 


Automated  Instruction 

In  addition  to  lifelong  and  transfer  learning,  we  also  investigated  using  on-demand  human 
intelligence  to  improve  learning  in  SDM  tasks.  We  designed,  implemented,  and  tested  a  system 
to  interface  with  Amazon’s  Mechanical  Turk  to  test  if  crowd  workers  are  able  to  accurately 
provide  such  advice  [de  la  Cruz  Jr.  et  ah,  2015]  (see  Appendix  F  for  details).  Videos  from  a  Pac- 
Man  task  were  uploaded  and  the  crowd  was  tested  in  four  distinct  scenarios.  First,  workers  were 
asked  to  identify  mistakes  either  in  a  real-time  mode,  where  they  had  only  a  single  opportunity  to 
view  agents’  actions,  or  in  a  review  mode,  where  they  could  pause  and  rewind  a  video.  Second, 
workers  were  asked  to  either  only  identify  when  the  agent  made  a  mistake,  (i.e.,  executed 
suboptimal  action),  or  to  identify  both  when  the  agent  made  a  mistake  and  also  to  suggest  an 
optimal  action  to  execute  in  its  stead.  This  work  is  the  first  research  to  establish  the  crowd’s 
ability  to  react  to  mistakes  made  by  an  intelligent  agent  in  real  time,  and  provide  accurate 
guidance  on  a  preferred  alternative  action.  Our  work  informs  the  design  of  future  systems  that 
use  human  intelligence  to  guide  untrained  systems  through  the  learning  process,  without  limiting 
systems  to  only  learn  from  their  mistakes  long  after  they  make  them. 


RESULTS  AND  DISCUSSION 

This  section  briefly  summarizes  results  from  this  project.  Full  details  can  be  found  in  the  cited 
(publically  accessible)  papers.  The  most  relevant  details  are  reproduced  in  appendices  A-G) 


Lifelong  Reinforcement  Learning 

We  evaluated  our  lifelong  RL  approaches  on  the  control  of  a  variety  of  dynamical  systems.  As  a 
representative  algorithm,  we  first  consider  PG-ELLA.  As  shown  in  Figure  4,  PG-ELLA  provides 
significant  improvement  in  performance  over  learning  tasks  in  isolation  using  standard  policy 
gradients.  As  PG-ELLA  gains  experience  with  more  tasks,  its  performance  improves.  We 
observe  the  same  improvement  in  the  more  challenging  task  of  quadrotor  control  (Figure  5). 
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Figure  4:  The  performance  of  PG-ELLA  vs  standard  policy  gradients  (eNAC)  on  benchmark  dynamical  systems. 


Figure  5:  Performance  of  lifelong  RL  on  quadrotor  control. 


As  shown  in  Figure  6  and  Figure  7,  the  fully-online  safe  lifelong  learner  produces  policies  that 
can  outperform  PG-ELLA  on  the  control  of  dynamical  systems.  Figure  6  (c-d)  and  Figure  8  also 
show  that  the  policies  produced  by  the  learner  always  obey  the  given  safety  constraints. 


(a)  Simple  Mass  (b)  Cart  Pole  (c)  Trajectory  Simple  Mass  (d)  Trajectory  Cart  Pole 

Figure  6:  Results  of  Safe  Online  Lifelong  RL  on  benchmark  simple  mass  and  cart-pole  systems.  Figures  (a)  and  (b)  depict 
performance  in  lifelong  learning  scenarios  over  consecutive  unconstrained  tasks,  showing  that  our  approach  outperforms 
standard  PG  and  PG-ELLA.  Figures  (c)  and  (d)  examine  the  ability  of  these  method  to  abide  by  safety  constraints  on 
sample  constrained  tasks,  depicting  two  dimensions  of  the  policy  space  (ai  vs.  012)  and  demonstrating  that  our  approach 

abides  by  the  constraints  (the  dashed  black  region). 
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Figure  7:  Performance  of  Safe  Online  Lifelong  RL  on  quadrotor  control. 
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Figure  8:  Average  number  of  task  observations  before  acquiring  policy  parameters  that  abide  by  the  constraints,  showing 
that  Safe  Online  Lifelong  RL  immediately  projects  policies  to  safe  regions. 

We  showed  that  incorporating  high-level  task  descriptions  improves  the  performance  of  multi¬ 
task  learning  (MTL)  and  lifelong  learning  (Figure  9),  and  that  our  approach  also  supports  zero- 
shot  transfer  to  new  tasks  given  only  their  high-level  task  description  (Figure  10). 
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Figure  9:  Performance  of  multi-task  (solid  lines),  lifelong  (dashed),  and  single-task  learning  (dotted)  on  benchmark 
dynamical  systems.  The  TaDeMTL  and  TaDeLL  algorithms  represent  MTL  and  lifelong  learning,  respectively,  with  high- 
level  task  descriptors.  The  rightmost  figure  compares  the  run-time  performance  of  lifelong  learning  with  task  descriptors 
against  an  alternative  (computationally  expensive)  method  for  reinforcement  learning  with  task  descriptors. 
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(a)  Jumpstart  (b)  Warm  Start:  Simple  Mass  (c)  Warm  Start:  Cart  Pole  (d)  Warm  Start:  Bicycle 

Figure  10:  Zero-shot  transfer  learning  to  new  tasks.  Figure  (a)  shows  the  initial  “jumpstart”  improvement  on  each  task 
domain;  Figures  (b)-(d)  depict  the  result  of  using  zero-shot  policies  as  warm  start  initializations  for  PG. 

In  addition  to  the  simple  simulation  domains  discussed  above,  we  also  investigated  lifelong 
learning  on  complex  simulators  and  robots  [Isele  et  al.,  2016b].  This  work  explicitly  frames 
lifelong  learning  as  that  of  disturbance  rejection:  given  that  all  robots  are  slightly  different, 
learning  on  a  new  robot  should  be  faster  after  learning  on  a  number  of  similar  robots.  In  these 
experiments  we  show  that  our  methods  do  generalize  to  real-world  domains.  This  work  (see 
Figure  11)  shows  that  lifelong  learning  did  successfully  improve  learning  in  simulated  turtlebots, 
simulated  AR-Drones,  and  physical  turtlebots. 


Figure  11:  Example  setting  where  a  simulated  turtlebots  with  different  imperfections  must  learn  to  reach  different  target 

areas. 


To  evaluate  cross-domain  lifelong  learning,  we  interleaved  tasks  from  six  different  domains 
(simple  mass,  double  mass,  cart  pole,  double  cart  pole,  bicycle,  helicopter),  and  then  evaluated 
learning  performance  on  each  task  domain.  Figure  12  shows  that  cross-domain  transfer  provides 
a  substantial  benefit  to  lifelong  learning,  significantly  increasing  performance  above  learning 
(via  PG-ELLA)  on  a  single  domain  of  tasks.  We  also  evaluated  the  performance  of  cross-domain 
transfer  learning  to  a  novel  task  domain  after  having  observed  tasks  from  other  task  domains.  We 
chose  the  most  difficult  task  domain  (helicopter)  as  the  novel  domain.  Figure  13  shows  the 
performance  on  the  helicopter  domain,  after  learning  on  the  other  five  task  domains.  These 
results  demonstrate  that  cross-domain  transfer  provides  a  significant  gain  in  performance  on  the 
novel  task  domain.  Figure  14  shows  the  difference  in  policy  reconstruction  quality  (an  indicator 
of  the  success  of  transfer  learning)  versus  the  Procrustes  measure  (a  measure  of  manifold 
alignment  quality).  The  high  correlation  between  these  two  quantities  suggests  that  it  may  be 
possible  to  examine  the  manifold  alignment  quality  for  the  transitions  underlying  each  domain  in 
order  to  assess  whether  or  not  transfer  will  succeed. 
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(a)  Double  Cart-Pole  (b)  Helicopter  (c)  Can-Pole 


D  CP  SM  DM  CP  Bike 


(d)  Jumpstart  over  PG 


Figure  12:  Performance  of  cross-domain  lifelong  learning  after  interleaved  training  over  multiple  task  domains.  Figures 
(a)  and  (b)  depict  task  domains  where  cross-domain  transfer  has  a  significant  impact,  showing  that  our  approach 
outperforms  standard  PG  and  PG-ELLA.  Figure  (c)  demonstrates  that  even  when  a  domain  benefits  less  from  cross¬ 
domain  transfer,  our  approach  still  achieves  equivalent  performance  to  PG-ELLA.  Figure  (d)  depicts  the  average 
improvement  in  initial  task  performance  over  PG  from  transfer. 


Figure  13:  Performance  of  cross-domain  transfer  on  a  novel  task  domain  (helicopter)  after  lifelong  learning  on  other 

domains. 


Figure  14:  Correlation  between  manifold  alignment  quality  (Procrustes  metric)  and  quality  of  the  transferred  knowledge 
for  cross-domain  transfer  between  simple  mass  (SM),  cart  pole  (CP),  and  three-link  cart  pole  (3CP)  systems. 


Quantifying  Negative  Transfer 

In  our  recent  paper  [Zhan  et  al.,  2016]  we  show  that  it  is  possible  to  quantify  negative  transfer 
when  using  a  class  of  advice  algorithms.  In  particular,  we  introduce  a  regret  ratio  that  defines 
how  good  a  teacher  is,  and  then  use  this  ratio  to  bound  how  much  this  teacher  can  help  a  student 
learn.  In  additional  to  theoretical  work,  a  set  of  experiments  show  that  our  algorithms  work  in 
practice  with  a  variety  of  teachers  (see  Figure  15).  This  work  opens  the  possibility  of  future 
studies  with  other  classes  of  advice  algorithms,  as  well  as  allows  us  to  better  understand  how  and 
why  negative  transfer  occurs. 
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The  results  of  this  work  are  as  follows.  First,  it  quantifies  the  occurrence  of  negative  transfer  in 
action  advice  models,  shedding  light  on  the  failure  modes  of  these  methods.  Second,  we  show 
that  high-quality  transfer  knowledge  may  still  cause  negative  transfer  when  the  target  algorithm 
is  able  to  outperform  the  source  knowledge.  Third,  it  is  important  for  the  researchers  to 
determine  whether  or  not  to  transfer  the  expert  knowledge  because  evaluation  of  the  transfer 
knowledge  is  expensive  (it  is  equivalent  to  evaluating  the  teacher  policy  in  the  target  MDP). 


Figure  15:  Experiments  on  “Combination  Lock,”  a  simple  benchmark  task  (Left)  and  “Block  Dude,”  a  complex 
sequential  decision  task  (Right) ,  show  that  our  method  can  benefit  not  only  from  an  optimal  teacher,  but  also  from  a 
random  teacher  and  a  teacher  that  always  suggests  the  worst  actions. 


Automated  Instruction 

To  generate  four  videos  used  in  the  user  study,  we  recorded  Pac-Man  (see  Figure  16,  Left)  being 
controlled  by  a  human  who  intentionally  made  different  types  of  mistakes.  Then,  we  picked  10- 
14  seconds  that  contained  one  (and  only  one)  mistake.  We  then  created  16  Human  Intelligence 
Tasks  (HITs)  on  Amazon’s  Mechanical  Turk  (4  videos  in  each  of  the  4  settings).  30  unique 
workers  performed  each  of  the  16  HITs. 


Your  Suggestions: 

PacMan  should  have  gone  left  at 
4.11s! 


Figure  16:  LEFT:  This  screenshot  shows  the  web  interface  of  the  user  study  with  game  layout  and  components  of  the  Pac¬ 
Man  game:  1)  Pac-Man,  2)  4  ghosts,  3)  Pills,  and  4)  Power  Pills.  RIGHT:  A  histogram  of  the  distribution  of  workers’ 

suggestions. 


Results  from  this  user  study  supported  multiple  hypotheses.  First,  the  crowd  can  collectively 
identify  the  correct  point  at  which  an  error  occurs  with  over  91%  accuracy  (see  Figure  16, 
Right).  Second,  we  demonstrate  that  not  only  can  this  mistake  identification  be  done  in  real  time 
(e.g.,  as  would  be  done  with  a  live  video  of  a  robot  learning)  with  a  mean  latency  of  just  0.39 
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seconds,  workers  are  also  able  to  successfully  identify  what  the  optimal  move  should  have  been. 
Third,  we  compare  the  crowd’s  performance  in  this  real-time  setting  with  an  offline  “review” 
setting  where  game  playback  can  be  controlled  and  replayed.  If  additional  time  is  available  (e.g., 
a  video  of  a  robot  performing  a  task  that  can  be  watched  multiple  times),  any  mistakes  can  be 
better  estimated:  our  data  show  a  mean  latency  of  0.15  seconds. 

The  contributions  in  this  research  are  to: 

•  Present  and  formalize  the  idea  that  on-line  crowds  can  provide  assistance  to  learning 
agents  in  real-time,  and  as  the  need  arises,  to  improve  performance. 

•  Demonstrate  that  crowd  workers  from  readily  accessible  platforms  such  as  Amazon 
Mechanical  Turk  can  respond  quickly  and  accurately  enough  to  provide  just-in-time 
feedback. 

•  Show  that  workers  can  also  improve  their  accuracy  in  post  hoc  review  settings  to  yield 
better  performance  in  future  situations. 

•  Discuss  the  types  and  sources  of  errors  from  the  crowd  observed  in  the  16  studies  that 
will  be  critical  for  successful  integration  of  crowd  feedback  and  reinforcement  learning. 

Future  work  will  include  fully  integrating  knowledge  from  the  crowd  with  autonomous  learning. 
We  expect  such  knowledge  will  allow  agents  to  more  quickly  leam  SDM  tasks,  with  relatively 
little  overhead. 


CONCLUSION 

This  work  has  substantially  advanced  the  field  of  lifelong  learning  for  sequential  decision  tasks, 
resulting  in  more  than  10  peer  reviewed  papers,  including  one  that  was  nominated  for  a  best 
paper  award  at  IJCAF 15.  We  have  developed  and  improved  multiple  algorithms  that  allow 
simulated  and  physical  robots  to  leam  a  sequence  of  tasks,  continually  accumulating  knowledge, 
while  improving  performance  on  past  tasks.  This  sequence  can  be  formed  from  a  set  of  tasks 
from  the  same  domain,  or  can  be  formed  from  multiple  domains  (e.g.,  cross-domain  transfer). 
Additionally,  human  knowledge  can  be  collected  from  non-expert  users,  which  can  potentially 
improve  the  performance  both  of  transfer  learning  and  of  base  learning  algorithms. 

While  this  work  has  been  largely  successful,  it  has  also  raised  multiple  exciting  open  questions 
for  future  work. 

•  How  can  lifelong  learning  methods  be  extended  beyond  linear  parametric  models  to  other 
types  of  base  learning  algorithms  (e.g.,  kernel  methods,  deep  learning,  etc.)  to  produce 
more  powerful  lifelong  learning  algorithms? 

•  How  and  under  what  conditions  can  we  autonomously  transfer  knowledge  between 
diverse  tasks,  including  across  agents  situated  in  radically  different  environments? 

•  Can  the  use  of  hierarchical  knowledge  in  lifelong  learning  allow  us  to  scale  to  large, 
more  complex  problems? 

•  How  can  we  incorporate  high-level  task  descriptions  or  partial  information  into  transfer 
learning  to  synthesize  models  for  novel,  complex  tasks? 

•  How  can  safe  reinforcement  learning  best  be  applied  to  physical  robots  in  order  to 
guarantee  policies  that  obey  required  constraints? 

•  Can  negative  transfer  be  quickly  identified  and  avoided? 

•  How  can  human  advice  best  be  incorporated  into  the  lifelong  learning  process? 
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The  following  appendices  provide  additional  technical  detail  regarding  the  most  salient 
contributions  referenced  in  the  above  final  report. 


Appendix  A:  Ruvolo  and  Eaton,  2013a 


ELLA:  An  Efficient  Lifelong  Learning  Algorithm 
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Abstract 

The  problem  of  learning  multiple  consecu¬ 
tive  tasks,  known  as  lifelong  learning ,  is  of 
great  importance  to  the  creation  of  intelli¬ 
gent,  general-purpose,  and  flexible  machines. 

In  this  paper,  we  develop  a  method  for  on¬ 
line  multi-task  learning  in  the  lifelong  learn¬ 
ing  setting.  The  proposed  Efficient  Life¬ 
long  Learning  Algorithm  (ELLA)  maintains 
a  sparsely  shared  l>asis  for  all  task  models, 
transfers  knowledge  from  the  basis  to  learn 
each  new  task,  and  refines  the  basis  over  time 
to  maximize  performance  across  all  tasks. 

We  show  tliat  ELLA  lias  strong  connections 
to  both  online  dictionary  learning  lor  sparse 
coding  and  state-of-the-art  batch  multi-task 
learning  methods,  and  provide  robust  the¬ 
oretical  performance  guarantees.  We  show 
empirically  that  ELLA  yields  nearly  Identi¬ 
cal  performance  to  batch  multi-task  learning 
while  learning  tasks  sequentially  in  three  or¬ 
ders  of  magnitude  (over  l,000x)  leas  time. 

1.  Introduction 

Versatile  learning  systems  must  be  capable  of  effi¬ 
ciently  and  continually  acquiring  knowledge  over  a  se¬ 
ries  of  prediction  tasks.  In  such  a  lifelong  learning 
setting,  the  agent  receives  tasks  sequentially.  At.  any 
time,  the  agent,  may  be  asked  to  solve  a  problem  from 
any  previous  bisk,  and  so  must  maximize  its  perfor¬ 
mance  across  all  learned  tasks  at  each  step.  When 
the  solutions  to  these  tasks  are  related  through  some 
underlying  structure,  the  agent  may  share  knowledge 
between  tasks  to  improve  learning  performance,  as  ex¬ 
plored  in  both  transfer  and  multi-task  learning. 

Despite  this  commonality,  current  algorithms  for 
transfer  and  multi-task  learning  are  insufficient  for  life¬ 
long  learning.  Transfer  learning  focuses  on  efficiently 
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modeling  a  new  target  task  by  leveraging  solutions  to 
previously  learned  source  tasks,  without  considering 
potential  improvements  to  the  source  task  models.  In 
contrast,  multi-task  learning  (MTL)  focuses  on  max¬ 
imizing  performance  across  all  tasks  through  shared 
knowledge,  at.  potentially  high  computational  cost. 
Lifelong  learning  includes  elements  of  both  paradigms, 
focusing  on  efficiently  learning  each  consecutive  task 
by  building  upon  previous  knowledge  while  optimiz¬ 
ing  performance  across  all  tasks,  hi  particular,  lifelong 
learning  incorporates  the  notion  of  reverse  transfer,  in 
which  learning  subsequent  tasks  can  improve  the  per¬ 
formance  of  previously  learned  task  models.  Lifelong 
learning  could  also  be  considered  as  online  MTL. 

In  this  paper,  we  develop  an  Efficient  Lifelong  Learn¬ 
ing  Algorithm  (ELLA)  that  incorporates  aspects  of 
both  transfer  and  multi-task  learning.  ELLA  learns 
and  maintains  a  library  of  latent  model  components  as 
a  shared  basis  for  all  task  models,  supporting  soft  task 
grouping  and  overlap  (Kumar  &  Daume  ill,  2012). 
As  each  new  task  arrives,  ELLA  transfers  knowledge 
through  the  shared  basis  to  learn  the  new  model,  and 
refines  the  basis  with  knowledge  from  the  new  task, 
By  refining  the  basis  over  time,  newly  acquired  knowl¬ 
edge  is  integrated  into  existing  basis  vectors,  thereby 
improving  previously  learned  task  models.  This  pro¬ 
cess  is  computationally  efficient,  and  we  provide  robust 
theoretical  guarantees  on  ELLA’s  performance  and 
convergence.  We  evaluate  ELLA  on  three  challeng¬ 
ing  multi-task  data  sets:  land  mine  detection,  facial 
expression  recognition,  and  student  exam  score  pre¬ 
diction.  Our  results  show  that  ELLA  achieves  nearly 
identical  performance  to  batch  MTL  with  three  orders 
of  magnitude  (over  l,000x)  speedup  in  learning  time. 
We  also  compare  ELLA  to  a  current  method  for  online 
MTL  (Saha  et  al..  2011),  and  find  that  ELLA  has  both 
lower  computational  cost  and  higher  performance. 

2,  Related  Work 

Early  work  on  lifelong  learning  focused  on  shar¬ 
ing  distance  metrics  using  task  clustering  (Thrun  & 
Q- Sullivan,  199G),  and  transferring  invariances  in  ncu- 
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ral  networks  (Thrun,  19%).  Lifelong  learning  has  also 
been  explored  for  reinforcement  learning  (Ring,  1997; 
Sutton  el  ah,  2007)  and  learning  by  reading  {Carlson 
et  ah,  2010).  In  contrast,  ELLA  is  a  general  algorithm 
that  supports  different  base  learners  to  learn  continu¬ 
ally,,  framed  ill  the  context  of  current  MTL  methods. 

Recently,  MTL  research  has  considered  the  use  of  a 
shared  basis  for  all  task  models  to  improve  learn¬ 
ing  over  a  set  of  tasks.  Several  formulations  of  this 
idea  have  been  proposed,  including  a  probabilistic 
framework  (Zhang  et  al,,  2008)  and  a  non-parametric 
Bayesian  method  that  automatically  selects  the  num¬ 
ber  of  bases  (Rai  &z  Dannie  ill,  2010).  These  meth¬ 
ods  assume  that  each  model  is  represented  as  a  pa¬ 
rameter  vector  that  is  a  linear  combination  of  these 
bases.  By  using  a  common  basis,  these  approaches 
share  information  between  learning  tasks  and  account 
for  task  relatedness  as  the  models  are  learned  in  tan¬ 
dem  with  the  basis.  The  GG-MTL  algorithm  (Kumar 
Dannie  III,  2012)  also  uses  a  sparsely  shared  basis 
for  multi-task  learning,  with  the  advantage  that  it.  au¬ 
tomatically  learns  (potentially)  overlapping  groups  of 
tasks  to  maximize  knowledge  transfer.  We  employ  this 
rich  model  of  underlying  task  structure  as  the  starting 
point  for  developing  ELLA. 

Few  papers  have  focused  on  the  development  of  very 
computationally  efficient  methods  for  MTL.  Simrn 
et  al.  (2011)  present  a  model  for  learning  multiple  tasks 
that  is  efficient  in  the  case  when  the  number  of  tasks  is 
very  large.  However,  their  approach  suffers  from  signif¬ 
icant.  drawbacks  in  comparison  with  ELLA:  (1)  their 
approach  is  not.  an  online  algorithm,  limiting  its  use 
in  the  lifelong  lemming  setting,  and  (2)  their  underly¬ 
ing  model  of  shared  task  structure  is  significantly  less 
flexible  than  our  model.  Another  approach,  GMTL 
by  Saba  et  al,  (2U11),  is  designed  to  provide  efficient 
performance  when  instances  and  new  tasks  arrive  in¬ 
crementally.  However,  OMTL  is  only  applicable  to 
classification  tasks  (not  regression)  and  relies  on  per¬ 
ception  learning,  which  we  found  to  perform  poorly  in 
comparison  to  other  base  learners  (see  Section  4), 

3.  Approach 

We  begin  by  describing  the  lifelong  learning  problem 
and  why  lifelong  learning  algorithms  must  be  able  to 
efficiently  learn  new  tasks  and  incorporate  new  train¬ 
ing  data  from  previous  tasks.  Then,  we  introduce 
ELLA,  which  efficiently  handles  both  of  these  opera¬ 
tions  through  a  two-stage  optimization  procedure.  We 
show  that  ELLA  encompasses  the  problem  of  online 
dictionary  learning  for  sparse  coding  as  a  special  case, 
anti  fin  is!  i  by  proving  robust  convergence  guarantees. 


|  t )  (ml  |mj  |m|  •  •  • 


fy(u?G  learning  taiki 


previously  learned 
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Figure  J.  An  illustration  of  the  lifelong  learning  process. 


This  paper  uses  the  following  conventions:  matrices 
are  denoted  by  bold  uppercase  letters,  vectors  are  de¬ 
noted  by  hold  lowercase  letters,  scalars  arc  denoted  by 
normal  lowercase  letters,  and  sets  are  denoted  using 
script  typeface  (e.g,,  A).  Parenthetical  superscripts 
denote  quantities  related  to  a  particular  task  (e.g.,  ma¬ 
trix  and  vector  vc^  arc  related  to  task  t)> 


3.1.  The  Lifelong  Learning  Problem 

A  lifelong  learning  agent  (Figure  1)  faces  a  series  of  su¬ 
pervised  learning  tasks  . . . ,  where 

each  task  Z^  —  (/(t\  X^l,  yt*))  is  defined  by  a  (hid¬ 
den)  mapping  from  an  instance 

space  X^  C  ]RC*  to  a  set  of  labels  (typically  = 
{— 1,  +1}  for  classification  tasks  and  =  E  lor  re¬ 
gression  tasks).  Each  task  t.  has  nt  training  instances 
X<T*  e  with  corresponding  labels  y ^  e 

given  by  We  assume  that  a  priori,  the  learner 
does  not  know  the  total  number  of  tasks  the 

distribution  of  these  tasks,  or  their  order. 

Each  time  step,  the  agent  receives  a  hatch  of  labeled 
training  data  for  some  task  A  either  a  new  task  or  a 
previously  learned  task.  Let  T  denote  the  number  of 
tasks  encountered  so  far,  After  receiving  each  batch  of 
data,  the  agent,  may  be  asked  to  make  predictions  on 
instances  of  any  previous  task.  Its  goal  is  to  construct 
task  models  , . . ,  where  each  3^*1 

such  that:  (1)  each  will  approximate  to  en¬ 
able  the  accurate  prediction  of  labels  for  new  instances, 

(2)  each  /(i)  can  be  rapidly  updated  as  the  agent,  en¬ 
counters  additional  Induing  data  for  known  tasks,  and 

(3)  new  fW'n  can  be  added  efficiently  as  the  agent,  ell- 
colintel's  iiewT  tasks.  We  assume  that  the  total  num¬ 
bers  of  tasks  TinaK  and  data  instances  Yl  f=r  n*  will  be 
large,  and  so  a  lifelong  learning  algorithm  must  have  a 
computational  complexity  to  update  the  models  that 
scales  favorably  with  both  quantities. 


Approved  for  Public  Release;  Distribution  Unlimited. 
14 


ELLAi  An  Efficient  Lifelong  Learning  Algorithm 


3.2,  Task  Structure  Model  for  ELLA 


ELLA  takes  a  parametric  approach  to  lifelong  learning 
in  which  the  prediction  function  /^-(x)  =  f(x:  8<<f) 
for  cadi  task  /  is  governed  by  the  task-specific  param¬ 
eter  vector  S*1'  €  \&t!-  To  model  the  relationships  be¬ 
tween  tasks,  we  assume  that  the  parameter  vectors  can 
be  represented  using  a  linear  combination  of  shared  la¬ 
tent  model  components  from  a  knowledge  repository. 
Many  recent  MTL  methods  employ  tills  same  tech¬ 
nique  of  using  a  shared  basis  as  a  means  to  transfer 
knowledge  between  learning  problems  (see  Section  2). 

Our  model  of  latent  task  structure  is  based  on  the  G0- 
MTL  model  proposed  by  Kumar  &  Dannie  111  (2012), 
ELLA  maintains  a  library  of  k  latent  model  compo¬ 
nents  L  €  shared  between  tasks.  Each  task  pa¬ 
rameter  vector  0 W  can  be  represented  as  a  linear  com¬ 
bination  of  the  columns  of  L  according  to  the  weight 
vector  E  R;'  (he,,  =  Ls{iJ),  We  encourage  the 

s^’s  to  be  sparse  (i.e..  use  few  latent  components)  in 
order  to  ensure  that  each  learned  model  component 
captures  a  maximal  reusable  chunk  of  knowledge. 

Given  the  labeled  training  data  for  each  task,  we  opti¬ 
mize  the  models  to  minimize  the  predictive  loss  over  all 
tasks  while  encouraging  the  models  to  share  structure. 
This  problem  is  realized  by  the  objective  function; 


f=l  V  i—l 


+  /,.||S(f)lli}  +  A]|L||j 


(1) 


where  (xf'\  yf^)  is  the  7th  labeled  training  instance  for 
task  L  C  is  a  known  loss  function,  and  the  L\  norm 
of  is  used  as  a  convex  approximation  to  the  true 
vector  sparsity.  This  is  similar  to  the  model  used  in 
GO- MTL,  with  the  modification  that  we  average  the 
model  losses  on  the  training  data  across  tasks  (giving 
rise  to  the  L  term).  This  modification  is  crucial  for 
obtaining  the  convergence  guarantees  in  Section  3.6. 

Since  Equation  1  is  not  jointly  convex  in  L  and  the 
s^s,  our  goal  will  be  to  develop  a  procedure  to  ar¬ 
rive  at  a  local  optimum  of  the  objective  function.  A 
common  approach  for  computing  a  local  optimum  for 
objective  functions  of  this  type  is  to  alternately  per¬ 
form  two  convex  optimization  steps;  one  in  which  L  is 
optimized  while  holding  the  s*^Ts  lixed,  and  another  in 
which  the  are  optimized  while  holding  L  fixed, 
Tliese  two  steps  are  then  repeated  until  convergence 
(this  is  the  approach  employed  for  model  optimization 
in  GO- MTL),  Next,  we  discuss  two  reasons  why  this 
approach  is  inefficient  and  thus  inapplicable  to  lifelong 
learning  with  many  tasks  and  data  instances. 


The  first  inefficiency  arises  due  to  the  explicit  de¬ 
pendence  of  Equation  1  on  all  of  the  previous 
training  data  (through  the  inner  summation).  We 
remove  this  inefficiency  by  approximating  Equa¬ 
tion  1  using  the  sOCOnd-Order  Tkylor  expansion  of 
(/  (xj°;0)  T^°)  around  0  =  0(t\  where 

0(0  _  mins  £  E?=i  £  (/  (that  is. 

is  an  optimal  predictor  learned  on  only  the  train¬ 
ing  data  for  task  t).  Plugging  the  second-order  Tkylor 
expansion  into  Equation  1  yields: 

9T  (L)  =  i  y]  niiii(2-||ell)  -  LHW|&(„ 

1  Et  >■  "*  (2) 

+  A*|sWlli)  +  A||L||f 
where  th  , 

0(t)  =  m  min  1  g C  (/  (x?> :  o)  ,  2,1°)  , 

and  ||v||^  =  vtAv.  In  Equation  2*  we  have  sup¬ 
pressed  the  constant  term  of  the  Taylor  expansion 
(since  it  floes  not  affect  the  minimum)  and  there  is  no 
linear  term  (since  by  construction  0*^  is  a  minimizer). 
Crucially*  we  have  removed  the  dependence  of  the  op- 
timization  on  the  number  of  data  instances  r*i . . .  nr 
in  each  task.  The  approximation  is  exact  in  an  im¬ 
portant  special  case;  when  the  model  is  linear  and  the 
loss  function  is  squared  loss  (see  Section  3,4.1). 

The  second  inefficiency  of  Equation  1  is  that  in  order  to 
evaluate  a  single  candidate  L,  an  optimization  prob¬ 
lem  must  be  solved  to  recompute  the  value  of  each 
of  the  (which  will  become  increasingly  expen¬ 
sive  as  the  number  of  tasks  learned  T  increases).  To 
Overcome  this  problem*  we  modify  the  formulation  in 
Equation  2  to  remove  the  minimization  over  We 
accomplish  this  by  computing  each  of  the  s^’s  when 
the  training  data  for  task  t  is  last  encountered,  and 
not  updating  them  when  training  on  other  tasks.  At 
first  glance  this  might  seem  to  prevent  the  ability  for 
previously  learned  tasks  to  benefit  from  training  on 
later  tasks  (which  we  call  averse  transfer );  however, 
these  tasks  can  benefit  by  subsequent  modifications 
to  L.  Later  in  Section  3,Gf  we  show  that  this  choice 
to  update  s1^  only  when  we  encounter  training  data 
for  the  respective  task  does  not  significantly  affect  the 
quality  of  model  fit  to  the  data  as  the  number  of  tasks 
grows  large.  Using  the  previously  computed  values  of 
s(0  gives  rise  to  the  following  optimization  procedure 
(where  we  use  the  notation  Lrji  to  refer  to  the  value  of 
the  latent  components  at  the  start  of  the  mth  itera¬ 
tion,  and  t  is  assumed  to  correspond  to  the  particular 
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Algorithm  1  ELLA(A^d,  At^t) 

T  *-  0,  Af-  zeros* 
b  < -  zeros*-  x  d,  1 ,  L  «-  zer os  j,  * 
while  i  s  M  oreTr  ai  n  in  gD  at  a  A  va  i  1  able  ( )  do 
(X„,,  ,  ym.w,  0  *-  getNextTrainingData() 
if  IsNewTask(t)  then 
T  <—  T  +  1 

X(*>  <-  Xncw,  y<‘>  <-  y„™ 
else 

Af-  A  -  {yV*  )®D<*> 
beb-  vec  (s(,)T  ®  (t)* 1'1 

X(‘>  <-  [X<1>  Xucw  ],  y(t>^[y[t>;  Yncwj 

end  if 

( 6 )  -f—  singleTaskLearner( X ^ \y ^ ) 
L  1—  reinitialize  AMZeroColumns(L) 

+-  Equation  3 

A  4-  A  +  (sl‘)s(')Tj  ®  D1') 
b  f—  b  +  vcc  (s«)T®(0«TDt‘>)) 

L  f-  mat  [(^A  (  Xlkxd,hxd)  1  P>) 

end  while 


task  for  which  we  just  received  training  data): 

s<‘>  «-argminf(L„„s<'>!0<'>,D<t>)  {3) 

Lm+i  f-  argininjmfL)  (4) 

flm(L)  =  A|]L||p  +  (5) 

1  t—J 

where 

/(LJ.,fl.D)  =  *|.||1  +  ||«-  Ls||p  .  (6) 

Next,  we  present  the  specific  steps  needed  to  perform 
the  updates  in  the  preceding  equations. 

3.3.  Model  Update  for  ELLA 

Suppose  that  at  the  mth  iteration  we  receive  training 
data  for  task  t.  We  must  perform  two  steps  to  update 
our  model:  compute  and  update  L.  In  order  to 
compute  s^,  we  first  compute  an  optimal  model  0^ 
using  only  the  data  from  task  f.  The  details  of  this 
step  will  depend  on  the  form  of  the  model  and  loss 
function  under  consideration,  and  thus  here  we  treat  it 
as  a  black  box.  If  the  training  data  for  a  particular  task 
arrive  interleaved  with  other  tasks  and  not  in  a  single 
batch j  it  may  be  important  to  use  an  online  single-task 
learning  algorithm  to  achieve  maximum  scalability. 

Once  9^  has  been  computed,  we  next  compute  D1** 
(which  is  model-dependent)  and  re-initialize  (either 
randomly  or  to  one  of  the  0^*s)  any  columns  of  L  that 
arc  all-zero  (which  will  occur  if  a  particular  latent  com¬ 


ponent  is  currently  unused).  We  then  compute  s***  us¬ 
ing  the  current  basis  Liri  by  solving  an  Lj  -regularized 
regression  problem — an  instance  of  the  Lasso. 

To  update  L,  we  null  the  gradient  of  Equation  5  and 
solve  for  L.  This  procedure  yields  the  updated  column¬ 
wise  vectorization  of  L  as  A  lb,  where: 

A  =  Xldxk.dxk  +  |  £  (s<‘>s<‘>T)  0D“>  (7) 

1  t= 1  v 

b  =  =  £w*(^‘,T®{«“,TDM))  .  (8) 

f  =  l 

To  avoid  having  to  sum  over  all  tasks  to  compute  A 
and  b  at  each  step,  we  construct  A  incrementally  as 
new  tasks  arrive  (see  Algorithm  1  for  details). 

Computational  Complexity :  Each  update  begins  by 
running  a  single-task  learner  to  compute  9^-  and 
we  assume  that  this  step  has  complexity  0(£(d,  )). 

Next,  to  update  requires  solving  an  instance  of  the 
Lasso,  which  has  complexity  Q(mfniin(«,,  d)),  where  d 
is  the  dimensionality  and  n  is  the  number  of  data  in¬ 
stances.  Equation  3  can  be  seen  as  an  instance  of  the 
Lasso  in  k  dimensions  with  d  data  instances,  for  a  to¬ 
tal  complexity  of  0(f£fc2).  However,  to  formulate  the 
Lasso  problem  requires  computing  the  cigcndecompo- 
sitiou  of  D^,  which  takes  C^d1),  and  multiplying  the 
matrix  square  root  of  D1’^  by  L,  which  takes  0{hd2). 
Therefore,  updating  s^  takes  time  G(d*  +  kd2  +dk2}. 
A  straightforward  algorithm  for  updating  L  involves 
inverting  a  (d  X  A)-by-(fi  X  k)  matrix,  which  has  com¬ 
plexity  0{ff3A’3).  However,  we  can  exploit,  the  fact  that 
the  updates  to  A  are  low-rank  to  derive  a  more  effi¬ 
cient  algorithm  with  complexity  0{drk2}  based  on  a 
recursive  method  (Yu,  1991)  for  updating  the  eigeude- 
composition  of  A  (see  Online  Appendix),  Therefore, 
using  this  more  advanced  approach,  the  overall  com¬ 
plexity  of  each  ELLA  update  is  Q(k2d 3  +  £(d,nf)). 

3.4.  Base  Learning  Algorithms 

Next,  we  show  how  two  popular  single- task  learning 
algorithms  can  be  used  as  the  base  learner  for  ELLA. 

3.4,1,  Linear  Regression 

In  this  setting  G  /(x:0)  =  9  x.  and  £  is  the 
squared- loss  function.  To  apply  ELLA,  we  compute 
the  optimal  single- task  model  which  is  available 

in  closed  form  as  (as¬ 
suming  that  is  full- rank).  is  also  avail- 

able  in  closed  form  as  D<*>  =  5i-X('»X,'>T.  Given 
flW  and  D<*>,  we  simply  follow  Algorithm  1  to  fill  in 
the  ino del-independent,  details. 
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3.4.2.  Logistic  Regression 


In  this  setting  e  {-1,  +1}"S  /(x:#)  — 

1/(i+e_<fT“)i  and  C  is  the  log-loss  function.  To  ap¬ 
ply  ELLA,  wc  first  use  a  single- Lusk  Learner  for  logistic 
regression  (of  which  there  are  many  free  and  robust 
implementations)  to  compute  the  value  of  9^.  is 
then  given  as: 


D(t)  -  ^X>?>(i V)T 


l+e-e’<l>TX;,) 


Given  these  formulas  for  0'’'  and  I .  we  follow  Al- 
gori thin  1  to  till  in  the  model-independent  details. 


3.5,  Connection  to  Dictionary  Learning  for 
Sparse  Coding 

ELLA  is  closely  connected  to  the  problem  of  learning 
a  dictionary  online  lor  sparse  coding  a  set  of  input  vec¬ 
tors.  In  fact,  this  problem  is  a  special  ease  of  ELLA  in 
which  the  9{t^s  are  given  as  input  instead  of  learned 
from  training  data  and  the  arc  equal  to  the  iden¬ 
tity  matrix.  These  simplifications  yield  the  following 
objective  function: 

L  T 

ft L)  =  A||L||I  +  -^||fff)-LS<!'||?  (9) 

f  =  l 

s{0  =  arg  min  jjt||s||i  +  I|0(t}  -  L(s||||  . 

Equation  9  is  identical  to  the  equation  used  for  effi¬ 
cient  online  dictionary  learning  by  Mairal  et  ah  (2009) 
with  the  one  difference  that  we  use  a.  soft  constraint  oil 
the  magnitude  of  the  entries  of  L  (Li  regularization), 
whereas  Mairal  et  ah  use  a  hard  length  constraint  on 
each  column  of  L. 


to  denote  which  part  of  the  expression  is  a  random 
variable.  The  expected  cost  represents  how  well  a  par¬ 
ticular  set  of  latent  components  can  be  nsec!  to  repre¬ 
sent  a  randomly  selected  task  given  that  the  knowledge: 
repository  L  is  not  modified. 

We  show  three  results  on  the  convergence  of  ELLA, 
given  respectively  as  Propositions  1-3: 

1.  The  latent,  model  component,  matrix,  L-;  .  becomes 
increasingly  stable  as  T  increases. 

2.  The  value  of  the  surrogate  cost  function,  tfr(IjT)T 
and  the  value  of  the  true  empirical  cost  function, 
tfrf  L/  ),  converge  almost  surely  (a,s.)  as  T  — ►  oc, 

3.  Lr  converges  asymptotically  to  a  stationary  point 
of  the  expected  loss  g. 

These  results  arc  based  on  the  following  assumptions: 

A.  The  tuples  are  drawn  i.i.ti  from  a  dis¬ 

tribution  with  compact  support. 

B.  For  all  L,  D^T  and  &^K  the  smallest  eigenvalue 
of  L~Di:'>L.-,  is  at  least  k  (with  k  >  0),  where 
7  is  the  set  of  non-zero  indices  of  the  vector 
SP)  _  arg min3  ||0^  —  Ls||^f0.  The  non-zero  ele¬ 
ments  of  the  unique  minimizing  s(rf  are  given  by: 

=  (L^D^>L7)“1  (L'liWeW  -  where 

the  vector  contains  the  signs  of  the  non-zero 
entries  of  s(ch 

Proportion;  J:  Lr+i  ~  L t  —  O  (^}. 

Proof  sketch;  First,  we  show  that  the  Li  regular¬ 
ization  term  bounds  the  maximum  magnitude  of  each 
entry  of  L,  and  that  the  L\  regularization  term  bounds 
the  maximum  magnitude  of  each  entry  of  s£t}.  Next, 
we  show  that  gr  —  Qt- \  is  Lipschitz  with  constant 
O  (^).  This  result  and  the  facts  that.  Lr_i  minimizes 
gr-l  and  the  regularization  term  ensures  that  the 
minimum  eigenvalue  of  tin:  Hessian  of  yr^t  is  lower 
bounded  by  2 A  allow  us  to  complete  the  proof,  B 


3.6.  Convergence  Guarantees 

Here,  we  provide  proof  sketches  for  three  results;  com¬ 
plete  proofs  arc  available  in  the  Online  Appendix,  For 
simplicity  of  exposition,  our  analysis  is  performed  in 
the  setting  where  ELLA  receives  training  data  for  a 
new  task  at  each  iteration.  Therefore,  the  number  of 
tasks  learned  T  is  always  equal  to  the  iteration  num¬ 
ber  m.  Extending  our  analysis  to  the  more  general 
case  outlined  in  Algorithm  1  is  straightforward.  Our 
convergence  proof  is  closely  modeled  on  the  analysis 
by  Mairal  et  al.  (2009). 

We  define  the  expected  cost  of  a  particular  L  as 

<?{L)  -  EDWj*()  [min3/(L;3,0tt>,DW)]  , 

where  we  use  a  subscript  on  the  expectation  operator 


Before  stating  our  next  proposition,  we  define: 

a(L,0{O,D(O)  =  argminffL^G^D^)  ,  (10) 

S 

and  introduce  the  folio  wing  lemma: 

Lemma  1 ; 

A.  min,i(L1s,9((),Dt*))  is  continuously  differ¬ 
entiable  in  L  with  Vl  mins£(L,s,  = 

-2DW  -  L»(L,0W,DW))  a(L,U^l,Dti>)T. 

B.  g  is  continuously  differentiable  with  Vq(L)  = 
2AI  4-  EffUyD(:t)  [VL  niina i( L,s, 

C.  Vlj(L)  is  Lipschitz  on  the  space  of  latent  model 
components  L. 

Proof  sketch:  Part  (A)  can  be  easily  shown  using  the 
fact  that  a  is  continuous  and  by  applying  a  corollary  of 
Theorem  4.1  as  stated  by  Bomians  &  Shapiro  (1998) 
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London  School  Data  The  London  Schools  data  set 
consists  of  exarni nation  scores  from  15,362  students  in 
139  schools  from  the  Inner  London  Education  Author¬ 
ity.  We  treat  the  data  from  each  school  as  a  separate 
task.  The  goal  is  to  predict  the  examination  score 
of  each  student.  Wre  use  the  same  feature  encoding 
as  used  by  Kumar  &  Daume  III  (2012),  where  four 
school-specific  categorical  variables  along  with  three 
student-specific  categorical  variables  are  encoded  as  a 
collection  of  binary  features.  In  addition,  we  use  the 
examination  year  and  a  bias  term  as  additional  fea¬ 
tures,  giving  each  data  instance  d  —  27  features. 

4.2.  Evaluation  Procedure 

Each  data  set  was  evaluated  using  10  randomly  gen¬ 
erated  50/50  splits  of  the  data  between  a  training  and 
hold-out  test  set.  The  particular  splits  were  stan¬ 
dardized  across  algorithms.  For  the  online  algorithms 
(ELLA  and  OMTL),  we  evaluated  100  randomized 
presentation  orders  of  the  tasks. 

The  parameter  values  of  k  and  A  for  ELLA  and  GO- 
MI  L  were  selected  independently  for  each  algorithm 
and  data  set  using  a  gridscarch  over  values  of  k  from 
1  to  either  10  or  (whichever  was  smaller)  and 

values  of  A  from  the  set  {e-5,  e1 ,  } .  For  ELLA, 

we  selected  the  parameter  values  based  on  the  train¬ 
ing  data  alone.  For  a  given  training/ test  split,  we 
further  subdivided  each  training  set  into  10  random 
50/50  sut)- training/ sut)- validation  sets  anti  then  chose 
parameter  values  that  maximized  the  average  perfor¬ 
mance  on  each  of  these  10  sub- validation  sets.  For 
GO-MTL,  OMTL,  and  STL,  the  particular  parameter 
values  were  selected  to  maximize  test  performance  on 
the  hold-out  data  averaged  across  all  10  random  splits. 
Note  that  this  procedure  of  choosing  parameter  values 
to  maximize  test  performance  provides  ELLA  with  a 
disadvantage  relative  to  the  other  algorithms. 

The  two  parameters  (burn-in  time  and  learning 
rate)  for  OMTL  were  optimized  using  a  grid- 
search.  The  burn-in  time  was  optimized  over  the  set 
(50. 100.  150. ... ,  400}  and  the  learning  rate  was  op¬ 
timized  over  the  set  (n-J0,e-2'' _ ,n0}.  We  report 

results  using  the  LogDet  update  rule  for  OMTL,  and 
found  that  the  results  did  not  vary  greatly  when  other 
rules  were  employed;  sec  (Saha  ct  al.,  2011)  for  more 
details  on  OMTL.  For  STL,  the  ridge  term  for  either 
logistic  or  linear  regression  was  selected  by  performing 
a  gridscarch  over  the  set  {e  ,  e-4, . . . ,  cr'}. 

Each  task  was  presented  sequentially  to  ELLA  and 
OMTL.  following  the  lifelong  learning  framework  (Sec¬ 
tion  3.1).  ELLA  learned  each  new  task  from  a  sin¬ 
gle  batch  of  data  that  contained  all  training  instances 


of  that  task.  For  OMTL,  which  learns  one  instance 
at  a.  time,  wc  performed  five  passes  over  the  training 
data  for  each  task.  We  also  tried  using  more  than  five 
passes,  but  the  OMTL  model  accuracy  did  not  increase 
further.  GO-MTL  was  considered  to  have  converged 
when  either  its  objective  function  value  decreased  by 
less  than  10“  *  or  2.U0U  iterations  were  executed. 

We  measured  predictive  performance  on  the  classifi¬ 
cation  problems  using  the  area  under  the  ROC  curve 
(AUC).  This  particular  performance  metric  was  cho¬ 
sen  since  both  classification  data  sets  had  highly  bi¬ 
ased  class  distributions,  and  therefore  other  metrics 
like  misclassification  rate  would  only  be  informative 
for  specific  applications  with  well-specified  tradeoffs 
between  true  and  false  positives.  For  regression  prot>- 
lems,  the;  performance  was  evaluated  using  the  neg¬ 
ative  root  mean-squared  error  (-rMSE)  metric  (with 
-rMSE,  higher  numbers  indicate  better  performance). 
Recall  that  OMTL  docs  not  support  regression,  and 
so  wc  do  not  evaluate  it  on  the  regression  tasks. 

The  computational  cost  of  each  algorithm  was  mea¬ 
sured  using  wall-clock  time  on  a  Mac  Pro  computer 
with  8GB  RAM  and  two  6-core  2,67GHz  Intel  Xcon 
processors.  We  report  the  running  time  for  the  batch 
GO-MTL  algorithm,  and  the  speedup  that  ELLA, 
STL.  and  OMTL  obtain  relative  to  the  batch  algo¬ 
rithm  both  lor  learning  all  tasks  and  for  learning  each 
consecutive  new  task.  We  optimized  the  implementa¬ 
tions  of  all  algorithms  to  ensure  a  fair  comparison. 

4.3.  Results 

For  classification  problems,  ELLA  achieves  nearly 
identical  performance  to  GO-MTL  (Table  1)  while  reg¬ 
istering  speedups  of  at  least  1,350  times  for  learning 
all  tasks  and  38,400  times  for  learning  each  new  task 
(Table  2).  In  addition,  OMTL,  which  is  specifically 
designed  for  learning  efficiently  online,  achieved  sig¬ 
nificantly  worse  accuracy  on  land  mine  detection  and 
moderately  worse  accuracy  on  facial  expression  recog¬ 
nition.  While  OMTL  did  run  much  faster  than  GO- 
MTL,  its  speed  tlid  not  match  ELLA'S.  STL  was  the 
fastest  approach  (ELLA  is  necessarily  slower  than  STL 
since  STL  is  used  as  a  subroutine  inside  ELLA),  but 
had  lower  accuracy  than  ELLA. 

We  find  similar  results  for  the  regression  problems, 
with  ELLA  achieving  nearly  identical  accuracy  to  GO- 
MTL  (within  1.1%  for  real  data  and  2.3%  for  syn¬ 
thetic  data;  see  Table  1),  while  achieving  dramatically 
shorter  learning  times  when  learning  all  tasks  (mini¬ 
mum  speedup  of  2,721  times)  and  each  new  task  (min¬ 
imum  speedup  of  378.219  times).  bTL  was  again  the 
fastest  of  all  approaches.  Imt.  had  lower  accuracy. 
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(originally  shown  by  Daily  kin  (1967))  Part  (B)  fol¬ 
lows  directly  since  the  tuple  is  drawn  from 

a  distribution  with  compact  support.  Part  (C)  of  the 
lemma  crucially  relics  on  Assumption  (B).  which  en¬ 
sures  that  the  Optimal  sparse  coding  solut  ion  is  unique. 
This  fact,  in  combination  with  some  properties  that 
the  optimal  sparse  coding  solution  must  satisfy,  allows 
ns  to  prove  that  rv  is  Lipschitz,  which  implies  that  Vl# 
is  Lipschitz  due  to  the  form  of  the  gradient  established 
in  Parts  (A)  and  (13).  ■ 

Proposition  2 ; 

A.  pt(Lt)  converges  a,s, 

B.  £t(Lt)  —  ^r(Lr)  converges  a.s.  to  0. 

C.  ^t(Lt)  —  j?(L t)  converges  a.s.  to  0. 

D.  ^{Lr)  converges  a.s. 

Proof  sketch:  First,  we  show  that  the  sum  of  the  pos¬ 
itive  variations  of  the  stochastic  process  Ur  =  <7r(Lv) 
are  bounded  by  invoking  a  corollary  of  the  Donsker 
theorem  ((Van  der  Vaart,  2000)  Chapter  19.2,  lemma 
19  36,  ex.  19.7).  Given  this  result,  we  apply  a  theorem 
from  (Fisk,  1965)  to  show  that  Ut  is  a  quasi- mart  ingale 
that  converges  almost  surely.  The  fact  that  tq  is  a 
quasi -martingale  along  with  a  simple  theorem  of  posi¬ 
tive  sequences  allows  us  to  prove  part  (B)  of  the  propo¬ 
sition.  The  final  two  parts  (C  &  D)  can  he  shown  due 
to  the  equivalence  of  g  and  gr  as  T  — j-  oo.  I 

Proposition  3:  The  distance  between  Lr  and  the  set  of 
g  *s  stationary  points  converges  a.s.  to  0  as  T  — >  oo. 

Proof  sketch;  We  employ  the  fact  that  both  the  sur¬ 
rogate  i/71  and  the  expected  cost  g  each  have  gradi¬ 
ents  that  are  Lipschitz  with  constant  independent  of 
T.  This  fact,  in  combination  with  the  fact  that  g-j  and 
g  converge  a.s.  as  2'  — >  oo,  completes  the  proof.  ■ 

4.  Evaluation 

We  evaluate  ELLA  against  three  other  approaches; 
(1)  GO-MTL  (Kumar  &  Daum6  III,  2012),  a  batch 
MTL  algorithm,  (2)  a  pcrccptron-bascd  approach  to 
Online  multi-task,  learning  (OMTL)  (Sahaet  aL,  201  1), 
and  (3)  independent  single?- task  learning  (STL).  GO- 
MTL  provides  a  reasonable  upper-bound  on  the  ac¬ 
curacy  of  the  models  learned  by  ELLA  (since  it  is  a 
batch  algorithm  that  optimises  all  task  models  simul¬ 
taneously).  We  arc  chiefly  interested  in  understanding 
the  tradeoff  in  accuracy  between  models  learned  with 
ELLA  and  GO-MTL.  and  the  computational  cost  of 
learning  these  models.  The  comparison  to  OMTL  al¬ 
lows  us  to  understand  how  the  performance  of  ELLA 
compares  with  another  approach  designed  to  learn  ef¬ 
ficiently  in  the  lifelong  learning  setting. 


4.1.  Data  Sets 

We  tested  each  algorithm  on  four  multi-task  data  sets; 
(1)  synthetic  regression  tasks,  (2)  land  mine  detection 
from  radar  images,  (3)  identification  of  three  different 
facial  movements  jy(jjri  photographs  of  a  subject,  and 
(4)  predicting  student,  exam  scores.  Data  sets  (2)  and 
(4)  arc  benchmark  data  sets  for  MTL.  Wc  introduce 
data  set  (3)  as  an  MTL  problem  for  the  first  time. 

Synthetic  Regression  Tasks  We  created  a  set 
of  TII1UX  =  100  random  tasks  with  d  —  13  features 
and  ft =  100  instances  per  task.  The  task  parameter 
vectors  0i-t  )  were  generated  as  a  linear  combination  of 
k  —  iy  randomly  generated  latent  components  in  IE12, 
The  vector's  had  a  sparsity  level  of  0,5  (i.e,,  half  the 
latent  components  were  used  to  construct  each  8^}. 
The  training  data  was  generated  from  a  standard 
normal  distribution.  The  training  labels  were  given 
as  y(r)  =  8^  +  e,  where  each  element  of  f  is 

independent  univariate  Gaussian  noise.  A  bias  term 
was  added  as  the  13th  feature  prior  to  learning. 

Land  Mine  Detection  In  the  land  mine  data 
set  (Xuc  el  ah,  2007),  the  goal  is  to  ilcteet  whether 
or  not  a  land  mine  is  present  in  an  area  based  on 
radar  images.  The;  input  features  arc  automatically 
extracted  from  radar  data  and  consist  of  four-moment 
based  features,  three  correlation- based  features,  one 
energy- ratio  feature,  one  spatial  variance  feature,  and 
a  Lias  term;  see  (Xuc  et  ah,  2007)  for  more  details.  The 
data  set  consists  of  a  total  of  14,820  data  instances  di¬ 
vided  into  29  different  geographical  regions.  We  treat 
each  geographical  region  as  a  different  task. 

Facial  Expression  Recognition  This  data  set  is 
from  a  recent  facial  expression  recognition  challenge 
(Vais tar  et  ah,  2011).  The  goal  is  to  detect  the  pres¬ 
ence  or  absence  of  three  different  facial  action  units 
{# 5:  upper  lid  raiser,  #10;  upper  lip  raiser,  and  #12; 
lip  corner  pul!)  from  an  image  of  a  subject’s  face.  Wc 
chose  this  combination  of  action  units  to  he  a  chal¬ 
lenge,  since  two  of  the  action  units  involve  the  lower 
face,  suggesting  a  high  potential  for  transfer,  while  the 
other  is  an  upper  face  action  unit,  suggesting  a  low 
potential  for  transfer.  Each  task  involves  recognizing 
one  of  the  three  action  units  for  one  of  seven  subjects, 
yielding  a  total  of  21  tasks,  each  with  450  999  images. 
To  represent  the  images,  we  utilized  a  Gabor  pyramid 
with  a  frequency  bandwidth  (if  0.7  octaves,  orienta¬ 
tion  bandwidth  of  120  degrees,  four  orientations,  576 
locations,  and  two  spatial  scales,  yielding  a  total  of 
2t88U  Gabor  features  for  each  image.  We  reduced  the 
raw  Gabor  outputs  to  100  dimensions  using  PC  A,  and 
added  a  bias  term  to  produce  the  input  features. 
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Tabic  L  The  accuracy  of  ELLA.  OMTL,  and  STL  relative  to  batch  multi- task  learning  (GO- MTL),  showing  that  ELLA 
achieves  nearly  equal  accuracy  to  batch  MTL  and  better  accuracy  than  OMTL.  The  N/A’s  indicate  that  OMTL  does  nut 
handle  regression  problems.  The  st  andard  deviation  of  a  value  is  given  after  the  ±  symbol - 


Dataset 

Problem 

Type 

Batch  MTL 
Accuracy 

ELLA  Relative 
Accuracy 

OMTL  Relative 
Accuracy 

STL  Relative 
Accuracy 

Land  Mine 
Facial  Bxpr. 
Syn.  Data 
London  Sdi. 

Classification 

Classification 

Regression 

Regression 

0.7802  ±0.013  (AlJCj 

0.6577  ±0.021  (AUC) 

—  1.084  ±  0.006  (-rMSE) 
-10.10  ±  0.066  (-rMSE) 

99.73  ±0.7% 

99-37  ±  3.1% 

97.74  ±2.7% 
98.90  ±  1,5% 

82.2  ±  3.0% 
97.58  ±  3.8% 
N/A 

N/A 

97.97  ±  1.5% 

97.34  ±  3.9% 
92.91  ±  1.5% 
97,20  ±  0.4% 

Tabic  2,  The  running  time  of  ELLA,  OMTL,  and  STL  as  compared  to  batch  multi-task  learning  (GO-MTL)*  showing  that 
ELLA  achieves  three  orders  of  magnitude  speedup  in  learning  all  tasks,  and  four-tofive  orders  of  magnitude  speedup  in 
learning  each  consecutive  new  I, ask.  The  N/A 'a  indicate  that  OMTL  does  not  handle  regression.  Speedup  was  measured 
relative  to  the  batch  method  using  optimized  implementations.  The  standard  deviation  of  a  value  is  given  after  the  ± 


Dataset 

Batch 

Runtime 

(seconds) 

ELLA 

All  Tasks 
(speedup) 

ELLA 

New  Task 
(speedup) 

OMTL 
All  Tasks 
(speedup) 

OMTL 
New  Task 
(speedup) 

STL 

All  Tasks 
(speedup) 

STL 

New  Task 
(speedup) 

Land  Mine 
Facial  Expr. 
Syn.  Data 
London  Sell. 

231  ±6.2 
2,200±92 
1,30O±141 
715±3G 

1 ,3  50  ±58 

lt828±l00 

5t026±685 

2. 721  ±225 

39t150± 17682 
3H.  100  i  240(1 
502  ±00  ±68 ,500 
378 .2 19±  3 1,275 

22±0.88 
948  ±65 

N/A 

N/A 

638  ±25 

19, 900  ±1,360 
N/A 

N/A 

3,342±409 

8t511±1t107 

1 56 ,489±  17,564 
30.  GOQ  ±4.800 

96 ,91 8±1 1,861 

1 78,71 9  ±23, 239 
1.6EG±1.8E5 
5.0EG±G.7E5 

Recall  that  ELLA  does  not  re-optimize  t  he  value  of  s ^ 
unless  it  receives  new  training  data  for  task  t.  There¬ 
fore,  in  each  experiment,  the  value  of  s*^  is  set  when 
the  training  data  lor  that  task  is  presented  and  never 
readjusted.  Although  the  values  of  are  not  up¬ 
dated,  it  is  still  possible  that  previously  learned  task 
models  can  benefit  from  training  on  subsequent  tasks 
through  modifications  to  L, 

To  assess  whether  this  phenomenon  of  reverse  transfer 
occurred t  we  computed  the  change  in  accuracy  from 
when  a  task  was  first  learned  until  after  all  tasks  had 
been  learned,  A  positive  change  in  accuracy  for  a  task 
indicates  that  reverse  transfer  did  occur.  Figure  2 
shows  this  change  in  accuracy  as  a  function  of  posi¬ 
tion  in  the  task  sequence,  revealing  that  reverse  trans¬ 
fer  occurred  reliably  In  all  data  sets  and  that  reverse 
transfer  is  mast,  beneficial  for  tasks  that  were  learned 
early  (when  the  total  amount  of  training  data  seen  was 
low).  Most  importantly,  these  results  show  that,  with 
few  exceptions,  subsequent  learning  did  not  reduce  the 
performance  of  models  that  were  learned  early. 

5.  Conclusion 

We  have  presented  an  efficient  algorithm  for  lifelong 
learning  (ELLA)  that  provides  nearly  identical  accu¬ 
racy  to  batch  MTL,  while  requiring  three  orders  of 
magnitude  less  runtime.  Also,  ELLA  is  more  flexible, 
faster,  and  achieves  better  accuracy  than  a  competing 
method  for  online  MTL.  We  have  shown  that  ELLA 
works  well  on  synthetic  data  as  well  as  three  multi-task 
problems.  Additionally,  we  discussed  ELLA's  connec¬ 
tions  to  online  dictionary  learning  for  sparse  coding. 


Task  Number 


Figure  2,  The  change  in  accuracy  from  when  a  task  is  first 
learned  until  all  tasks  have  been  learned,  as  a  function  of 
position  in  the  task  sequence.  Plotted  lines  arc  the  best 
fitting  exponential  curve. 

and  presented  theoretical  guarantees  that  illuminate 
the  reasons  for  ELLA’s  strong  performance.  Our  fu¬ 
ture  work  will  include  extending  ELLA  to  settings  bed¬ 
sides  linear  and  logistic  models  and  automatically  ad¬ 
justing  the  basis  size  k  as  it.  learns  more  tasks. 
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Abstract 

The  success  of  applying  policy  gradient  reinforcement 
learning  (RL)  to  difficult  control  tasks  hinges  crucially 
on  the  ability  to  determine  a  sensible  initialization  for 
the  policy.  Transfer  learning  methods  tackle  this  prob¬ 
lem  by  reusing  knowledge  gleaned  from  solving  other 
related  tasks.  In  the  case  of  multiple  task  domains*  these 
algorithms  require  an  inter-tusk  mapping  to  facilitate 
knowledge  transfer  across  domains.  However*  there  are 
currently  no  general  methods  to  learn  an  inter-task  map¬ 
ping  without  requiring  either  background  knowledge 
that  is  not  typically  present  in  RL  settings*  or  an  ex¬ 
pensive  analysis  of  an  exponential  number  of  inter  task 
mappings  in  the  size  of  the  state  and  action  spaces. 

This  paper  introduces  an  autonomous  framework  that 
uses  unsupervised  manifold  alignment  to  learn  inter¬ 
task  mappings  and  effectively  transfer  samples  between 
different  task  domains.  Empirical  results  on  diverse  dy¬ 
namical  systems,  including  an  application  to  quadmtor 
control,  demonstrate  its  effectiveness  for  cross-domain 
transfer  in  the  context  of  policy  gradient  RL. 

Introduction 

Policy  gradient  reinforcement  learning  (RL)  algorithms 
have  been  applied  with  considerable  success  to  solve  high- 
dimensional  control  problems,  such  as  those  arising  in 
robotic  control  and  coordination  (Peters  &  Schaal  2008). 
These  algorithms  use  gradient  ascent  to  tunc  the  parameters 
of  a  policy  to  maximize  its  expected  performance.  Unfor¬ 
tunately,  this  gradient  ascent  procedure  is  prone  to  becom¬ 
ing  trapped  in  local  maxima,  and  thus  it  has  been  widely 
recognized  that  initializing  the  policy  in  a  sensible  manner 
is  crucial  for  achieving  optimal  performance.  For  instance* 
one  typical  strategy  is  to  initialize  the  policy  using  human 
demonstrations  (Peters  &  Schaal  2006)*  which  may  be  in¬ 
feasible  when  the  task  cannot  be  easily  solved  by  a  human. 
This  paper  explores  a  different  approach:  instead  of  initial¬ 
izing  the  policy  at  random  (i.e,*  tabula  rasa)  or  via  human 
demonstrations,  we  instead  use  transfer  learning  (TL)  to  ini¬ 
tialize  the  policy  for  a  new  target  domain  based  on  knowl¬ 
edge  from  one  or  more  source  tasks. 

In  RL  transfer,  the  source  and  target  tasks  may  differ 
in  their  formulations  (Taylor  &  Stone  2009).  In  particular* 
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when  the  source  and  target  tasks  have  different  stale  and/or 
action  spaces*  an  inter-task  mapping  (Taylor  et  al.  2007a) 
that  describes  the  relationship  between  the  two  tasks  is  typ¬ 
ically  needed.  This  paper  introduces  a  framework  for  au¬ 
tonomously  learning  an  inter- task  mapping  Tor  cross-domain 
transfer  in  policy  gradient  RL.  First*  we  leam  an  inter-state 
mapping  (i.e.,  a  mapping  between  stales  in  two  tasks)  using 
unsupervised  manifold  alignment.  Manifold  alignment  pro¬ 
vides  a  powerful  and  general  framework  that  can  discover 
a  shared  latent  re  presentation  to  capture  intrinsic  relations 
between  different  tasks*  irrespective  of  their  d  i  mens  ion  ai- 
ity.  The  alignment  also  yields  an  implicit  inter  action  map¬ 
ping  that  is  generated  by  mapping  tracking  states  from  the 
source  to  the  target.  Given  the  mapping  between  task  do¬ 
mains*  source  task  trajectories  are  then  used  to  initialize  a 
policy  in  the  target  task,  significantly  improving  the  speed 
of  subsequent  learning  over  an  uninformed  initialization. 

This  paper  provides  the  following  contributions.  First,  we 
introduce  a  novel  un supervised  method  for  learning  inter¬ 
state  mappings  using  manifold  alignment.  Second*  we  show 
that  the  discovered  subspace  can  be  used  to  initialize  the 
target  policy.  Third,  our  empirical  validation  conducted  on 
four  dissimilar  and  dynamically  chaotic  task  domains  (e  g., 
controlling  a  three- 1  ink  cart-pole  and  a  quadmtor  aerial  ve¬ 
hicle)  shows  that  our  approach  can  a)  automatically  learn 
an  inter-state  mapping  across  MDPs  from  the  same  domain, 
b)  automatically  learn  an  inter-state  mapping  across  MDPs 
from  very  different  domains*  and  c)  transfer  informative  ini¬ 
tial  policies  to  achieve  higher  initial  performance  and  reduce 
the  lime  needed  for  convergence  to  near-optimal  behavior. 

Related  Work 

Learning  an  inter- task  mapping  has  been  of  major  interest 
in  the  transfer  learning  community  because  of  its  promise 
of  autonomous  transfer  between  very  different  tasks  (Tay¬ 
lor  &  Stone  2009),  However*  the  majority  of  existing  work 
assumes  that  a)  the  source  task  and  target  task  are  similar 
enough  that  no  mapping  is  needed  (Banerjee  &  Stone  2007; 
Konidaris  &  Barto  2007),  or  b)  an  inter  task  mapping  is  pro¬ 
vided  to  the  agent  (Tty  I  or  et  al.  2007a;  Torrcy  et  al.  2008). 
The  main  difference  between  these  methods  and  this  paper  is 
that  wc  are  interested  in  teaming  a  mapping  between  tasks. 

There  has  been  some  recent  work  on  learning  such  map¬ 
pings.  For  example*  mappings  may  be  based  on  seman- 
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lie  knowledge  about  stale  features  between  two  tasks  (Liu 
Sc  Stone  2006),  background  knowledge  about  the  range  or 
type  of  stale  variables  (Taylor  el  al.  2007b),  or  transition 
models  tor  each  possible  mapping  could  be  generated  and 
tested  (Taylor  el  al.  2008).  However,  there  are  currently  no 
general  methods  to  leam  an  inter-task  mapping  without  re¬ 
quiring  either  background  knowledge  that  is  not  typically 
present  in  RL  settings,  or  an  expensive  analysis  of  an  ex¬ 
ponential  number  (in  the  size  of  the  action  and  state  vari¬ 
able  sets)  of  inler-lask  mappings.  We  overcome  these  issues 
by  automatically  discovering  high-level  features  and  using 
them  to  transfer  knowledge  between  agents  without  suffer¬ 
ing  from  an  exponential  explosion. 

In  previous  work,  we  used  sparse  coding,  sparse  projec¬ 
tion,  and  sparse  Gaussian  processes  to  learn  an  inter-task 
mapping  between  MDPs  with  arbitrary  variations  (Bou  Ain- 
maret  al.  2012).  However,  this  previous  work  relied  on  a  Eu¬ 
clidean  distance  correlation  between  source  and  target  task 
triplets,  which  may  fail  for  highly  dissimilar  tasks.  Addition¬ 
ally,  it  placed  restrictions  on  the  inter  task  mapping  that  re¬ 
duced  the  flexibility  of  the  learned  mapping.  In  other  related 
work,  Bosci  ei  al.  (2013)  use  manifold  alignment  to  assist  in 
transfer  The  primary  differences  with  our  work  are  that  the 
authors  o)  focus  on  transferring  models  between  different 
robots,  rather  than  polieies/samples,  acid  b)  rely  on  source 
and  target  robots  That  are  qualitatively  similar. 


Background 

Rein  lor  cement  Learning  problems  involve  an  agent  choos¬ 
ing  sequential  actions  to  maximize  its  expected  return.  Such 
problems  arc  typically  formalized  as  a  Markov  decision  pro¬ 
cess  (MDP)  T  —  (S}A,V§iV,r)i  where  S  is  the  (poten¬ 
tially  infinite)  set  of  states,  A  is  the  set  of  actions  that  the 
agent  may  execute,  Vo  :  S  >  [0, 1]  is  a  probability  distribu¬ 
tion  over  the  initial  state,  [0. 1]  is  a  state 

transition  probability  function  describing  the  task  dynamics, 
and  r  :  S  x  A  x  S  -A  D£  is  the  reward  function  measuring 
the  performance  of  the  agent,  A  policy  x  :  S  x  A  — ^  [0, 1]  is 
defined  as  a  conditional  probability  distribution  over  actions 
given  the  current  state.  The  agent’s  goal  is  to  find  a  policy 
x*  which  maximizes  the  average  expected  reward; 

r  i H 

x*  =  arg max  E  —  ^r(st,at,st+i) 


(I) 


=  arg  max  /  pK(r)R(T)(iT  , 

IT  Jl 


where  T  is  the  set  of  all  possible  trajectories  with  horizon  //, 
H 

(2) 


i  " 

*(r)  =  j  ^r(stla£,st+i)  ,  and 


*  (3) 


Policy  Gradient  methods  (Sutton  ct  al.  1999;  Peters  et  al. 
2005)  represent  the  agent’s  policy  x  as  a  function  defined 
over  a  vector  9  e  of  control  parameters  and  a  vector  of 
state  features  given  by  the  transformation  #  :  3  — ^  M”\ 


By  subsli luting  this  parameterization  of  the  control  pol¬ 
icy  into  Eqtu  (2),  we  can  compute  the  parameters  of  the 
optimal  policy  as  fP  =  arginruc^  J{&)*  where  J{9)  — 
Jl  />^0)(T)7£(T)fjfr.  To  maximize  27,  many  policy  gradient 
methods  employ  standard  supervised  function  approxima¬ 
tion  to  learn  0  by  following  an  estimated  gradient  of  a  lower 
bound  on  the  expected  return  of  J(B). 

Policy  gradient  algorithms  have  gained  attention  in  the 
RL  community  in  part  due  to  their  successful  applications 
on  real-world  robotics  (Peters  et  al.  2005).  While  such  al¬ 
gorithms  have  a  low  computational  cost  per  update,  high¬ 
dimensional  problems  require  many  updates  (by  acquiring 
new  rollouts)  to  achieve  good  performance.  Transfer  learn¬ 
ing  can  reduce  this  data  requirement  and  accelerate  learning. 

Since  policy  gradient  methods  are  prone  to  becoming 
stuck  in  local  maxima,  it  is  crucial  that  the  policy  be  ini¬ 
tialized  in  a  sensible  fashion.  A  common  technique  (Peters 
Sc  Schaal  200b;  Argali  el  al.  2009)  for  policy  initialization  is 
to  first  collect  demonstrations  from  a  human  controlling  the 
system,  then  use  supervised  learning  to  fit  policy  parameters 
that  maximize  the  likelihood  of  the  human-demonstrated  ac¬ 
tions,  and  finally  use  the  fitted  parameters  as  the  initial  pol¬ 
icy  parameters  for  a  policy  gradient  algorithm.  While  this 
approach  works  well  in  some  settings,  it  is  inapplicable  in 
several  common  cases:  a)  when  it  is  difficult  to  instrument 
the  system  in  question  so  that  a  human  can  successfully  per¬ 
form  a  demonstration,  b )  when  an  agent  is  constantly  faced 
with  new  tasks,  making  gathering  human  demonstrations  for 
each  new  task  impractical,  or  c)  when  the  tasks  in  question 
cannot  be  intuitively  solved  by  a  human  demonstrator. 

The  next  section  introduces  a  method  for  using  transfer 
learning  to  initialize  the  parameters  of  a  policy  in  a  way 
that  is  not  susceptible  to  these  limitations.  Our  experimental 
results  show  that  this  method  of  policy  initialization,  when 
compared  to  random  policy  initialization,  is  able  to  not  only 
achieve  better  initial  performance,  but  also  obtain  a  higher 
performing  policy  when  run  until  convergence, 

Policy  Gradient  Transfer  Learning 

Transfer  learning  aims  to  improve  learning  times  and/or 
behavior  of  an  agent  on  a  new  target  task  T^V)  by 
reusing  knowledge  from  a  solved  source  task  7~(SJ.  In 
RL  settings,  each  task  is  described  by  an  MDP:  task 
T<s>  =  {S^,As>,V(0s\v^,r^)  and  T'r>  = 

,  A*1} .  > One  way  in  whic  h  know!- 

edge  from  a  solved  source  task  can  be  leveraged  to  solve  the 
target  task  is  by  mapping  optimal  {state,  action,  next  state) 
triples  from  the  source  task  into  the  stale  and  action  spaces 
of  the  target  task.  Transferring  optimal  triples  in  this  way  al¬ 
lows  us  to  both  provide  a  better  jumpstart  and  learning  abil¬ 
ity  to  the  target  agent,  based  on  the  source  agent’s  ability. 

While  the  preceding  idea  is  attractive,  complexities  arise 
when  the  source  and  target  tasks  have  different  state  and/or 
action  spaces.  In  this  case,  one  must  define  an  inter-task 
mapping  x  in  order  to  translate  optimal  triples  from  the 
source  to  the  target  task.  Typically  (Taylor  Si  Stone  2009),  x 
is  defined  by  two  sub-mappings;  (1)  an  inter-state  mapping 
Xs  and  (2)  an  inter-action  mapping  XA ■ 
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Source  Domain 


I-  I  mm  rmrr.Hnm^in  mir,n.inj*  Target  Domain 


Phase  II;  Cross-domain  transfer  via  Xv 


Figure  I  Transfer  is  split  into  two  phases:  (l)  learning  the 
inter-state  mapping  via  manifold  alignment,  and  (II)  ini¬ 
tializing  the  target  policy  via  mapping  the  source  task  policy. 


By  adopting  an  RL  frame  work  where  policies  are  state- 
feedback  controllers*  we  show  that  we  can  use  optimal  state 
trajectories  from  the  source  task  to  intelligently  initialize  a 
control  policy  in  the  target  task,  without  needing  to  explicitly 
construct  an  inter  action  mapping.  We  accomplish  this  by 
learning  a  (pseudo- invertible)  inter-state  mapping  between 
the  stale  spaces  of  a  pair  of  tasks  using  manifold  alignment, 
which  can  then  be  used  to  transfer  optimal  sequences  of 
slates  to  the  target.  The  fact  that  our  algorithm  does  not  re¬ 
quire  Learning  an  explicit  inter-action  mapping  significantly 
reduces  its  computational  complexity. 

Our  approach  consists  of  two  phases  (Figure  1).  First,  us¬ 
ing  traces  gathered  in  the  source  and  target  tasks,  wre  team  an 
inter-stale  mapping  xs  using  manifold  alignment  (“Phase  I” 
in  Figure  1).  To  perform  this  step,  wre  adapt  the  Un super¬ 
vised  Manifold  Alignment  (UMA)  algorithm  (Wang  &  Ma¬ 
li  ade  van  2009),  as  detailed  in  the  next  section.  Second,  we 
use  xs  to  project  state  trajectories  from  the  source  to  the  tar¬ 
get  task  (“Phase  IP'  in  Figure  1).  These  projected  stale  tra¬ 
jectories  define  a  set  of  a  tracking  trajectories  for  the  target 
task  that  allow  us  to  perform  one  step  of  policy  gradient  im¬ 
provement  in  the  target  task.  This  policy  improvement  step 
intelligently  initializes  the  target  policy*  which  results  in  su¬ 
perior  learning  performance  than  starling  from  a  randomly 
initialized  policy*  as  show  n  in  our  experiments.  Although  we 
focus  on  policy  gradient  methods,  our  approach  could  eas¬ 
ily  be  adapted  to  other  policy  search  methods  (e.g.,  PoWER* 
REPS,  etc.;  see  Kober  el  al.  2013). 

Learning  an  Inter-State  Mapping 


r(T)  —  j  ^  are  obtained  by  utiliz¬ 

ing  which  is  initialized  using  randomly  selected  policy 
parameters.  For  simplicity  of  exposition,  wc  assume  that  tra¬ 
jectories  in  the  source  domain  have  length  H$  and  those  in 
the  target  domain  have  length  Ht\  however*  our  algorithm  is 
capable  of  handling  variable- length  trajectories.  Wc  arc  in¬ 
terested  in  the  setting  where  data  is  scarcer  in  the  target  task 
than  in  the  source  task  (he.*  nr  <£  ns). 

Given  trajectories  from  both  the  source  and  target  tasks, 
wc  flatten  the  trajectories  (i.e.*  wc  treat  the  states  as  un- 
urdered)  and  then  apply  the  task- specific  state  transforma¬ 
tion  to  obtain  twro  sets  of  state  feature  vectors,  one  for  the 
source  task  and  one  for  the  target  task.  Speeifieally,  wc  cre¬ 
ate  the  following  sets  of  points: 

X<s)  -  {.*<«  , . . .  , 

*<*>  . (■Jr;,m*)} 

(8wm) .....  *m  , 

,t(T) . 

Given  XfS)  €  Jtmsx(/fsxns>f  x(T)  €  riTxC^XttT}*  we 

can  apply  the  UMA  algorithm  (Wang  &  Mahadevan  2009) 
with  minimal  modification,  as  described  next. 


Unsupervised  Manifold  Alignment  (UMA)  The  first 
step  of  applying  UMA  to  learn  the  inter-state  mapping  is 
to  represent  each  transformed  state  in  both  the  source  and 
target  tasks  in  terms  of  its  local  geometry.  We  use  the  no¬ 
tation  R  is)  E  it<fc+OH(fc+i)  to  refer  to  the  matrix  of  pair- 
wise  Euclidean  distances  among  the  ^-nearest  neighbors  of 
Similarly,  R  rn  refers  to  the  equivalent  ma- 

trix  of  distances  for  the  A-ncarcst  neighbors  of  *  E  Xtrj. 

The  relations  between  local  geometries  in  X1  ^  and  X1, 7  1 
are  represented  by  the  matrix  WeR^*  "t)  with 


u>i:j  =  exp|— dist^Rx(*) ■  Kx<t^  j  and  distance  metric 
dist  l  Rx<v)j|'  = 

||f. 


min 

1  <  X  <  A; ! 


(4) 


Unsupervised  Manifold  Alignment  (UMA)  is  a  tech¬ 
nique  that  efficiently  discovers  an  alignment  between  two 
datasets  (Wang  &  Mahadevan  2009).  UMA  w'as  developed 
to  align  datasets  for  knowledge  transfer  between  two  super¬ 
vised  learning  tasks.  Here*  we  adapt  UMA  to  an  RL  set¬ 
ting  by  aligning  source  and  target  task  state  spaces  with 
potentially  different  dimensions  and  mj-,  To  learn  xs 
relating  and  S^TK  trajectories  of  states  in  the  source 

task,  =  {,<*<*•, _ |  ,  are  obtained  by 

following  ,  and  trajectories  of  states  in  the  target  task* 


We  use  the  notation  i  ■  Ih  to  denote  the  htk  variant  of  the  Ad 
permutations  of  the  rows  and  columns  of  the  input  matrix, 
||  ■  ||r  is.  the  Frobenius  norm,  and  71  and  72  arc  defined  as: 

7^  = - - - -  1 - 


«r(l£<«  Wymlk) 

tr(R^R^r,) 
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To  align  the  manifolds,  UMA  computes  the  join!  Laplacian 


{  loc^j+^r^  —  ^r( f)  \ 

v  lx(t>  +  /irw  ) 


(5) 


with  diagonal  matrices  Tu)  e  H£(«sx^s)x(«sxtfs)  and 
fH)  g  R{nTxHT)x(nTxHr)'  where  =  £\  ^ .  .  and 

r^4j  =  Wi,j.  The  matrices  rt2J  e  ]ffiC»s>fWs- )x(nTx//T) 
and  €  R(nTJ£j/T)J£<ni  join  the  two  manifolds  wilh 
rS  j  =  atld  =  ™j.i 

Additionally,  the  non- normalised  Laplaeiaus  LX(j)  and 
Lx<t)  are  defined  as:  Lx^>  —  Dx,s>  WX(*j  and  Lx(tj  = 
DX(  n  -  Wxtr),  where  Dxu*>  e  ](£<**  «**>*("*  «**)  is 
a  diagonal  matrix  with  D^<!2>  —  ^  -  and.  similarly, 

^ ■  j\  The  matrices  and  repre¬ 

sent  the  similarity  in  the  source  and  target  task  state  spaces 
respectively  and  can  be  computed  similar  to  W. 

To  join  the  manifolds,  UMA  first  defines  two  matrices: 


To  initialize  tT(T),  we  first  sample  m  initial  target  trajec¬ 
tories  =  jrp  |  from  the  target  task  using  a  ran- 

domly  initialized  policy  (these  can  be  newly  sampled  tra¬ 
jectories  or  simply  the  ones  used  to  do  the  initial  manifold 
alignment  step).  Next,  we  map  the  set  of  initial  states  in  Z?CTJ 
to  the  source  task  using  Xs-  We  then  run  the  optimal  source 
policy  starting  from  each  of  these  mapped  initial  states  to 
produce  a  set  of  m  optimal  stale  trajectories  in  the  source 
task.  Finally,  the  resulting  state  trajectories  are  mapped  hack 
to  the  target  task  using  xs  to  generate  a  set  of  reflected  stalc- 

trajcctories  in  the  target  task,  T>^Ti  =  |t^T"  j  .  Fore  Un¬ 
ity,  we  assume  that  all  trajectories  are  of  length  H ;  however, 
this  is  not  a  fundamental  limitation  of  our  algor  ithm. 

We  define  the  following  transfer  cost  function: 

tn 

J7-,s)_T(r)  (ern)  =  ^ ;'e<T>(WT))^m (t/T t/71)  (8) 

t=1 


Given  Z  and  D.  UMA  computes  optimal  projections  to  re¬ 
duce  the  dimensionality  of  the  joint  structure  by  taking  the  d 
minimum  eigenvectors  Ci . of  the  generalized  eigen¬ 

value  decomposition  ZLZV  £  =  AZOZ 1  £.  The  optimal  pro¬ 
jections  and  Cu(T')  aie  then  given  as  the  first  d[  and  d<£ 

rows  of  [Ci  r - Cd]*  respectively. 

Given  the  embedding  discovered  by  UMA.  we  can  then 
define  the  inter-state  mapping  as: 

Xs[h]  =  a('T)Q£CS)H  ■  ^ 

The  inverse  of  the  inter-state  mapping  (to  project  target 
states  to  the  source  task)  can  be  determined  by  taking  the 
pseudo-inverse  of  Eqn.  (7),  yielding  X$[m]  ~  CK^Q£^[“j. 

Intuitively,  this  approach  aligns  the  important  regions  of 
the  source  task’s  state  space  (sampled  based  on  optimal 
source  trajectories)  with  the  state  space  explored  so  far  in 
the  target  task.  Although  actions  were  ignored  in  construct¬ 
ing  the  manifolds,  the  aligned  representation  implicitly  cap¬ 
tures  local  stale  transition  dynamics  within  each  task  (since 
the  states  came  from  trajectories),  providing  a  mechanism  to 
transfer  trajectories  between  tasks,  as  we  describe  next. 

Policy  Transfer  and  Improvement 

Next,  we  discuss  the  procedure  lor  initializing  the  target  pol¬ 
icy,  ?T(t)-  We  consider  a  model -free  setting  in  which  the  pol¬ 
icy  is  linear  over  a  set  of  (potentially)  non-linear  state  fea¬ 
ture  functions  modulated  by  Gaussian  noise  (where  the  mag¬ 
nitude  of  the  noise  balances  exploration  and  exploitation). 
Specifically,  we  can  write  the  source  and  target  policies  as: 

^(«0-*m(»lsf)T#<s>‘  +  6lw 

where  e{S)  ~  ,V  (0,  E(s>)  and  e{r)  ~  M  (0,  E(r)). 


where  '&(T)  is  a  cost  function  that  penalizes  deviations  be¬ 
tween  the  initial  sampled  trajectories  in  the  target  task  and 
the  reflected  optimal  trajectories: 

Minimizing,  Eqn.  (8)  is  equivalent  to  attaining  a  target  pol¬ 
icy  parameterization  0^  1  such  that  Ttrp)  follows  the  re¬ 
flected  trajectories  K  Further,  Eqn.  (8)  is  in  exactly  the 
form  required  To  apply  standard  off-the-shelf  policy  gra¬ 
dient  algorithms  to  minimize  the  transfer  cost.  The  Mani¬ 
fold  Alignment  Cross-Domain  Transfer  for  Policy  Gradients 
(MAXDT-PG)  framework  is  detailed1  in  Algorithm  I. 

Special  Cases 

Our  work  can  be  seen  as  an  extension  of  the  simpler  model- 
based  case  with  a  linear- quadratic  regulator  (LQR)  (Bcmpo- 
rad  et  al.  2002)  policy,  which  is  derived  and  explained  in  the 
online  appendix-  accompanying  this  paper,  Although  the  as¬ 
sumptions  made  by  the  model-based  case  seem  restrictive, 
the  analysis  in  the  appendix  covers  a  wride  range  of  appli¬ 
cations.  These,  Tor  example,  include:  a)  the  case  in  which 
a  dynamical  model  is  provided  beforehand,  or  b)  the  case  in 
which  model-based  RL  algorithms  are  adopted  (see  Busoni u 
et  al.  2010).  In  the  main  paper,  however,  we  consider  the 
more  general  model- free  case. 

Experiments  and  Results 

To  assess  M  AXDT-PG’s  performance,  we  conduced  experi¬ 
ments  on  transfer  both  between  tasks  in  the  same  domain  as 
well  as  between  tasks  in  different  domains.  Also,  we  studied 

‘Lines  9-13  of  Algorithm  l  require  interact  ion  with  the  target 
domain  (or  a  simulator)  for  acquiring  the  optimal  policy.  Such  an 
assumption  is  common  to  policy  gradient  methods,  where  al  each 
iteration,  data  is  gathered  and  used  to  iteratively  improve  the  policy. 

"The  online  appendix  is  available  on  the  authors'  websites. 
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Algorithm  I  Manifold  Alignment  Cross-Domain  Transfer 
for  Policy  Gradients  (maxdt-pg) 

Inputs:  Source  and  target  tasks  T ^  and  T^tK  optimal 
source  policy  tt*^,  #  source  and  target  traces  ns  and 
nr,  #  nearest  neighbors  k,  #  target  rollouts  zr,  initial  # 
of  target  states  m. 

Learn  xs'- 

1:  Sample  ns  optimal  source  traces,  and  tit  random 
target  traces,  t^) 

2:  Using  the  modified  UMA  approach,  learn  ct(3)  and 
a{T)  to  produce  \s  = 

Transfer  &  Initialize  Policy: 

3:  Collect  m  initial  target  states  ~  Vq 

4:  Project  these  m  states  to  the  source  by  applying  xj  [■] 

5:  Apply  the  optimal  source  policy  on  these  projected 

states  to  collect  } 

6:  Project  the  samples  in  to  the  target  using  Xs[m]  to 
produce  tracking  target  traces  ViT] 

7:  Compute  tracking  rewards  using  Eqn.  (9) 

S:  Use  policy  gradients  to  minimize  Eqn.  (8),  yielding 
Improve  Policy: 

9:  Start  with  0'''r\  and  sample  zj-  target  rollouts 
10:  Follow  policy  gradients  (c.g.,  episodic  REINFORCE) 
but  using  target  rewards 

II:  Return  optimal  target  policy  parameters  0*r, 


the  robustness  of  the  learned  mapping  by  varying  the  num¬ 
ber  of  source  and  target  samples  used  for  transfer  and  mea¬ 
suring  the  resultant  target  task  performance.  In  all  cases  we 
compared  the  performance  of  MAXDT-PG  to  standard  policy 
gradient  learners.  Our  results  show  that  MAXDT-PG  was  able 
to:  a)  learn  a  valid  inter-state  mapping  with  relatively  little 
data  from  the  target  task,  and  b)  effectively  transfer  between 
tasks  from  either  the  same  or  different  domains. 

Dynamical  System  Domains 

We  tested  maxdt-pg  and  standard  policy  gradient  learn¬ 
ing  on  lour  dynamical  systems  (Figure  2).  On  all  systems, 
the  reward  function  was  based  on  two  factors:  a)  penalizing 
states  far  from  the  goal  state,  and  b)  penalizing  high  forces 
(actions)  to  encourage  smooth,  low-energy  movements, 
Simple  Mass  Spring  Dumper  ISM):  The  goal  with  the 
SM  is  to  control  the  mass  at  a  specified  position  with  zero 
velocity.  The  system  dynamics  are  described  by  two  state- 
variables  that  represent  the  mass  position  and  velocity,  and 
a  single  force  F  that  acts  on  the  cart  in  the  x  direction. 

Cart  Pole  (CP);  The  goal  is  to  swing  up  and  then  bal¬ 
ance  the  pole  vertically.  The  system  dynamics  are  described 
via  a  four-dimensional  state  vector  (x.  x.  9,  9),  represent¬ 
ing  the  position,  velocity  of  the  cart,  and  the  angle  and  an¬ 
gular  velocity  of  the  pole,  respectively.  The  actions  consist 
of  a  force  that  acts  on  the  carl  in  the  x  direction. 


(c)  Three- Link  Carl  Pole  (d)  Quadrotor 


Figure  2:  Dynamical  systems  used  in  the  experiments. 

Three-Link  Cart  Pole  (3CP);  The  3CP  dynam¬ 
ics  are  described  via  an  eight-dimensional  state  vector 
(x,  X,  9 1,  6 1 .  f?£.  So,  9;i<  0%)>  where  x  and  x  describe 
the  position  and  velocity  of  the  cart  and  9 }  and  B}  represent 
the  angle  and  angular  velocity  of  the  jth  link,  The  system  is 
controlled  by  applying  a  force  F  to  the  cart  in  the  x  direc¬ 
tion,  with  the  goal  of  balancing  the  three  poles  upright. 

Quadrntcir  (QR):  The  system  dynamics  were  adopted 
from  a  simulator  validated  on  real  quadrotors  (Bouabdal- 
lah  2007;  Voos  &  Bou  Arnmar  2010),  and  are  described  via 
three  angles  and  three  angular  velocities  in  the  body  frame 
(i.e.,  ci  i  a,  csu,  and  e^s).  The  actions  consist  of  four  rotor 
torques  {.F\.  /V  F$,  F4}.  Each  task  corresponds  to  a  differ¬ 
ent  quadrotor  configuration  (e  g.,  different  armature  lengths, 
etc.),  and  the  goal  is  to  stabilize  the  different  quadmlors. 

Same-Domain  Transfer 

We  first  evaluate  MAXDT-PG  on  same-domain  transfer. 
Within  each  domain,  we  can  obtain  different  tasks  by  vary¬ 
ing  the  system  parameters  (e.g.,  for  the  SM  system  wre  varied 
mass  A/,  spring  constant  K,  and  damping  constant  h)  as  well 
as  the  reward  functions.  We  assessed  the  performance  of  us¬ 
ing  the  transferred  policy  from  MAXDT-PG  versus  standard 
policy  gradients  by  measuring  the  average  reward  on  the  tar¬ 
get  task  vs,  the  amount  of  learning  iterations  in  the  target.  We 
also  examined  the  robustness  of  MAXDT-PG’s  performance 
based  on  the  number  of  source  and  target  samples  used  to 
learn  xs-  Rewards  were  averaged  over  500  traces  collected 
from  150  initial  states.  Due  to  space  constraints,  we  report 
same-domain  transfer  results  here;  details  of  the  tasks  and 
experimental  procedure  can  he  found  in  the  appendix2. 

Figure  3  shows  maxdt-pg's  performance  using  varying 
numbers  of  source  and  target  samples  to  learn  xs-  These  re¬ 
sults  reveal  that  transfer-initialized  policies  outperform  stan¬ 
dard  policy  gradient  initialization.  Further,  as  the  number  of 
samples  used  to  learn  xs  increases,  so  does  both  the  ini¬ 
tial  and  final  performance  in  all  domains.  All  initializations 
result  in  equal  per- iteration  computational  cost.  Therefore, 
maxdt-pg  both  improves  sample  complexity  and  reduces 
wall-clock  learning  time. 
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Figure  3:  Same-domain  transfer  results.  All  plots  share  the  same  legend  and  vertical  axis  label 


Figure  4:  Cross-domain  transfer  results.  Plots  (a)— (c)  depict  target  task  performance,  and  share  the  same  legend  and  axis  labels. 
Plot  (d)  shows  the  correlation  between  manifold  alignment  quality  (Procrustes  metric)  and  quality  of  the  transferred  knowledge. 


Cross- Domain  Transfer 

Next,  we  consider  the  more  difficult  problem  of  cross- 
domain  transfer.  The  experimental  setup  is  identical  to  the 
same-domain  case  with  the  crucial  difference  that  the  slate 
and/or  action  spaces  were  different  for  the  source  and  the  tar¬ 
get  task  (since  the  tasks  were  from  different  domains).  We 
tested  three  cross-domain  transfer  scenarios:  simple  mass  to 
carl  pole,  carl  pole  to  three-link  cart  pole,  and  cart  pole  to 
quadrotor.  In  each  case,  the  source  and  target  task  have  dif¬ 
ferent  numbers  of  stale  variables  and  system  dynamics.  De¬ 
tails  of  these  experiments  are  available  in  the  appendix2. 

Figure  4  shows  the  results  of  cross-domain  transfer, 
demonstrating  that  MAXDT-PG  can  achieve  successful  trans¬ 
fer  between  different  task  domains.  These  results  reinforce 
the  conclusions  of  the  same -domain  transfer  experiments, 
showing  that  a)  transfer-initialized  policies  outperform  stan¬ 
dard  policy  gradients,  even  between  different  task  domains 
and  b)  initial  and  final  performance  improves  as  more  sam¬ 
ples  are  used  to  learn  Xs- 

We  also  examined  the  correlation  between  The  quality  of 
the  manifold  alignment,  as  assessed  by  the  Procrustes  metric 
(Goldberg  &  Ritov  2009)*  and  the  quality  of  the  transferred 
knowledge,  as  measured  by  the  distance  between  the  trans¬ 
ferred  (#,r)  and  the  optimal  (0*)  parameters  (f  igure  4(d)), 
On  both  measures,  smaller  values  indicate  better  quality. 
Each  data  point  represents  a  transfer  scenario  between  two 
different  tasks,  from  either  SM,  CP,  or  3CP;  wc  did  not  con¬ 
sider  quadrotor  tasks  due  to  the  required  simulator  lime.  Al¬ 


though  we  show  that  the  Procrustes  measure  is  positively 
correlated  with  transfer  quality.  ure  hesitate  to  recommend 
it  as  a  predictive  measure  of  transfer  performance.  In  our 
approach,  the  cross-domain  mapping  is  not  guaranteed  to  be 
orthogonal,  and  therefore  the  Procrustes  measure  is  not  theo¬ 
retically  guaranteed  to  accurately  measure  the  quality  of  the 
global  embedding  (Le,,  Goldberg  and  Ri lav's  (2009)  Corol¬ 
lary  I  is  not  guaranteed  to  hold),  but  the  Procrustes  measure 
still  appears  correlated  with  transfer  quality  in  practice. 

We  can  conclude  that  MAXDT-PG  is  capable  of:  a)  auto¬ 
matically  learning  an  inter-state  mapping,  and  /;)  effectively 
transferring  between  different  domain  systems.  Even  when 
the  source  and  target  tasks  are  highly  dissimilar  (e.g.,  cart 
pole  to  quadrotor),  maxdt-pg  is  capable  of  successfully 
providing  target  policy  initializations  that  outperform  state- 
of-the-art  policy  gradient  techniques. 

Conclusion 

Wc  introduced  maxdt-pg,  a  technique  for  autonomous 
transfer  between  policy  gradient  RL  algorithms,  maxdt-pg 
employs  unsupervised  manifold  alignment  to  learn  an  inter¬ 
state  mapping,  which  is  then  used  to  transfer  samples  and 
initialize  the  target  task  policy,  MAXDT-PG’s  performance 
was  evaluated  on  four  dynamical  systems,  demonstrating 
that  MAXDT-PG  is  capable  of  improving  both  an  agent" s  ini¬ 
tial  and  final  performance  relative  to  using  policy  gradient 
algorithms  without  transfer,  even  across  different  domains. 


Approved  for  Public  Release;  Distribution  Unlimited. 

27 


A  c  know  I  ed  gem  e  n  ts 

This  research  was  supported  by  ONR  grant  #NOOOI4-j  I  - 
1-0139,  AFOSR  grant  #FA8750-I4-1-0069,  and  NSF  gram 
IIS- 1 1 499 1 7,  We  thank  the  anonymous  reviewers  for  their 
helpful  feedback. 

References 

A:  gall,  B.  D,;  Chernova,  S,;  Vcloso,  M,;  and  Browning, 
B.  2009.  A  survey  of  robot  learning  from  demonstration. 
Robotics  and  Autonomous  Systems  57(5);469-4S3, 
Banerjee,  Br)  and  Stone,  P.  2007.  General  game  learning  us¬ 
ing  knowledge  transfer.  In  Proceedings  of  the  20th  Interna¬ 
tional  Joint  Conference  on  Artificial  Intelligence,  672—677. 
Bemporad,  A.;  Morari,  M.:  Dua,  V.'  and  Pislikopoulos,  E. 
2002.  The  explicit  linear  quadratic  regulator  Tor  constrained 
systems.  Automation  38(l):3-20. 

Bocsi,  B.;  Csato.  L.;  and  Peters,  J.  2013.  Alignment -based 
transfer  learning  for  robot  models.  In  Proceedings  of  the  In¬ 
let  national  Joint  Conference  on  Neural  Networks  (IJCNN). 
Bou  Am  mar,  H,;  Taylor,  M.;  Tuyls,  K,;  Oriessens,  K,;  and 
Weiss.  G.  2012,  Reinforcement  learning  transfer  via  sparse 
coding.  In  Proceedings  of  the  Ilth  Conference  on  Aw- 
tonomous  Agents  and  Multiagent  Systems  ( AAMAS ). 
Bouahdallah,  S,  2007.  Design  and  Control  of  Quad  rotors 
with  Application  to  Autonomous  Flying .  Ph  D.  Dissertation, 
Ecole  poly  technique  fddbrale  de  Lausanne. 

Busoni  u,  L.;  Ha  busk  a.  R.;  De  Sch  utter,  B.;  and  Ernst,  D. 
2010.  Reinforcement  Learning  and  Dynamic  Programming 
Us  my  Function  Approximators.  Boca  Raton,  Florida:  CRC 
Press, 

Goldberg,  Y.;  and  Rilov,  Y.  2009.  Local  Procrustes  Tor  man¬ 
ifold  embedding:  a  measure  of  embedding  quality  and  ein- 
bedd  i  n  g  algori  thru  s.  Machin  e  Lea  n\  ing  77(1):  1-25. 

Kober,  J.:  Bagnell,  A,;  and  Peters,  J.  2013.  Reinforce¬ 
ment  learning  in  robotics:  a  survey.  International  Journal 
of  Robotics  Resea  ah  32(11):  1238-1274 
Ron  id  arts,  G.,  and  Baiio,  A.  2007.  Building  portable  op¬ 
tions:  Skill  transfer  in  reinforcement  learning.  In  Proceed¬ 
ings  of  the  20th  International  Joint  Conference  on  Artificial 
Intelligence ,  895-900. 

Liu,  Y.,  and  Stone,  P.  2006.  Value-function- based  trans¬ 
fer  for  reinforcement  learning  using  structure  mapping.  In 
Proceedings  of  the  2 1st  National  Conference  on  Artificial 
intelligence,  415-20. 

Peters,  J.,  and  Schaal,  S,  2006.  Policy  gradient  methods 
for  robotics.  In  Proceedings  of  the  IEEEJRSJ  International 
Conference  on  Intelligent  Robots  and  Systems,  2219-2225, 
Peters,  J.,  and  Sehaal,  S.  2008.  Natural  actor-critic.  Neuro- 
computing  71(7-9):  1 1 80-1 190, 

Peters,  J.;  Vijayakumar,  S,;  and  Sehaal,  S.  2005.  Natural 
actor-critic.  In  Proceedings  of  the  16th  European  Confer¬ 
ence  on  Machine  Learning  {EC ML),  280-291,  Springer. 

Sutton,  R,  S,;  McAllcstcr,  D,  A.;  Singh,  S,  P,;  and  Mansour, 
Y.  1999.  Policy  gradient  methods  for  reinforcement  learning 


with  function  approximation.  N enrol  Information  Process¬ 
ing  Systems  1057—1063 

Taylor,  M.  E.T  and  Stone*  P.  2009.  Transfer  learning  for  rein¬ 
forcement  learning  domains:  a  survey.  Journal  of  Machine 
Learning  Research  10: 1 633-1685. 

Taylor,  M.  E.\  Kuhlmann,  G.;  and  Stone,  P.  2008.  Au¬ 
tonomous  transfer  for  reinforcement  learning.  In  Pro¬ 
ceedings  of  the  7th  International  Joint  Conference  on  Au¬ 
tonomous  Agents  and  Multiagent  Systems  (AAMAS),  283- 
290. 

Taylor,  M.  E  :  Stone.  P;  and  Liu.  Y.  2007,  Transfer  learn¬ 
ing  via  inter- task  mappings  for  temporal  difference  learning. 
Journal  of  Machine  Learning  Research  8(  1):2 1 25-2 1 67. 
Taylor,  M.  E.;  Whiteson,  S.;  and  Stone,  P.  2007.  Transfer  via 
inter-task  mappings  in  policy  search  reinforcement  learning. 
In  Proceedings  of  the  6th  International  Joint  Conference  on 
Autonomous  Agents  and  Multiagent  Systems. 

Torrey,  L.;  Shavlik,  L;  Walker,  T  ;  and  Maclin*  R.  2008. 
Relational  macros  lor  transfer  in  reinforcement  learning. 
In  Blocked,  II.:  Ramon,  J.:  Shavlik,  1:  and  Tadepalli,  P., 
eds„  Inductive  Logic  Programming T  volume  4894  of  Lec¬ 
ture  Notes  in  Computer  Science.  Springer  Berlin  Heidelberg. 
254-268. 

Voos,  H.,  and  Bou  Ammar,  H.  2010.  Nonlinear  tracking 
and  landing  controller  Tor  quadrotor  aerial  robots.  In  Pro¬ 
ceedings  of  the  IEEE  International  Conference  on  Control 
Applications  (CCA),  2 !  36—2 141, 

Wang,  C,T  and  Mahadevan,  S.  2009,  Manifold  alignment 
without  correspondence.  In  Proceedings  of  the  21st  Inter¬ 
national  Joint  Conference  on  Artificial  Intelligence  (IJCAl), 
1273-1278.  Morgan  Kaufmann. 


Approved  for  Public  Release;  Distribution  Unlimited. 
28 


Appendix  C:  Bou  Aamar  et  al.,  2015b 


Safe  Policy  Search  for  Lifelong  Reinforcement  Learning  with  Sublinear  Regret 


1 1  a  i  Ilia  ill  Bo  u  A  inmar  H  a  rm  a  M  B  @  S  E  AS  .UPENN.EDU 

Rasul  Til  Ionov  TUTUNOV  @  SEAS  .UPENN  ,EDU 

Eric  Eaton  EEATON@ClS.UPENN.EDU 

University  of  Pennsylvania,  Computer  and  Information  Science  Department,  Philadelphia,  PA  19104  USA 


Abstract 

Lifelong  reinforcement  learning  provides  a 
promising  framework  for  developing  versatile 
agents  that  can  accumulate  knowledge  over  a 
lifetime  of  experience  and  rapidly  learn  new 
tasks  by  building  upon  prior  knowledge.  How¬ 
ever,  current  lifelong  learning  methods  exhibit 
non-vanishing  regret  as  the  amount  of  experience 
increases,  and  include  limitations  that  can  lead  to 
suboptimal  or  unsafe  control  policies.  To  address 
these  issues,  we  develop  a  lifelong  policy  gra¬ 
dient  learner  that  operates  in  an  adversarial  set- 
ling  to  learn  multiple  tasks  online  while  enforc¬ 
ing  safety  constraints  on  the  learned  policies.  We 
demonstrate,  for  the  first  time,  sublinear  regret 
for  lifelong  policy  search,  and  validate  our  algo¬ 
rithm  on  several  benchmark  dynamical  systems 
and  an  application  to  quad  rotor  control. 

1.  Introduction 

Reinforcement  learning  (RJL)  (Busoniu  et  al,  2010;  Sutton 
&  Barto,  1998)  often  requires  substantial  experience  be¬ 
fore  achieving  acceptable  performance  on  individual  con¬ 
trol  problems.  One  major  contributor  to  this  issue  is  the 
tabula- rasa  assumption  of  typical  RL  methods,  w  hich  learn 
from  scratch  on  each  new  task.  In  these  settings,  learning 
performance  is  directly  correlated  with  the  quality  of  the 
acquired  samples.  Unfortunately,  the  amount  of  experience 
necessary  for  high-quality  performance  increases  exponen¬ 
tially  with  the  tasks'  degrees  of  freedom,  inhibiting  the  ap¬ 
plication  of  RL  to  high-dimensional  control  problems. 

When  data  is  in  limited  supply,  transfer  learning  can  signifi¬ 
cantly  improve  model  performance  on  new  tasks  by  reusing 
previous  learned  knowledge  during  training  (Taylor  Sc 
Stone,  2009;  Gheshlaghi  Azarct  al.,  2013;  Lazaric,  2011; 
Fcrranic  ct  al,  2008;  Bou  Ammar  ct  al,  2012),  Multi¬ 
task  learning  (MTL)  explores  another  notion  of  knowl¬ 
edge  transfer,  in  which  task  models  are  trained  simultane- 

Proceedings  of  the  3£nd  International  Conference  on  Machine 
Learning,  Lille,  France,  2013.  JMLR:  W&CP  volume  37.  Copy¬ 
right  20 1 5  by  the  authors). 


ously  and  share  knowledge  during  the  joint  learning  pro¬ 
cess  (Wilson  el  al.,  2007;  Zhang  et  al.,  2008). 

In  the  lifelong  learning  setting  (Thom  Sc  O'Sullivan, 
I996a;b)h  which  can  be  framed  as  an  online  MTL  prob¬ 
lem,  agents  acquire  knowledge  incrementally  by  learning 
multiple  tasks  consecutively  over  their  lifetime.  Recently, 
based  on  the  work  of  Ruvolo  &  Eaton  (2013)  on  super¬ 
vised  lifelong  learning,  Bou  Ammar  et  al.  (2014)  devel¬ 
oped  a  lifelong  learner  for  policy  gradient  RL.  To  ensure 
efficient  learning  over  consecutive  tasks,  these  works  em¬ 
ploy  a  second-order  Taylor  expansion  around  the  parame¬ 
ters  that  are  (locally)  optimal  for  each  task  without  trans¬ 
fer.  This  assumption  simplifies  the  MTL  objective  into  a 
weighted  quadratic  form  for  online  learning,  but  since  it  is 
based  on  single-task  learning,  this  technique  can  lead  to  pa¬ 
rameters  far  from  globally  optimal  Consequently,  the  suc¬ 
cess  of  these  methods  for  RL  highly  depends  on  tire  pol¬ 
icy  initializations,  which  must  lead  to  near-optimal  trajec¬ 
tories  for  meaningful  updates.  Also,  since  their  objective 
functions  average  loss  over  all  tasks,  these  methods  exhibit 
non-vanishing  regrets  of  the  form  0(f?)T  where  R  is  the 
total  number  of  rounds  in  a  nan-adversarial  setting. 

In  addition,  these  methods  may  produce  control  policies 
with  unsafe  behavior  (i.e.,  capable  of  causing  damage  to 
the  agent  or  environment,  catastrophic  failure,  etc.).  This  is 
a  critical  issue  in  robotic  control,  where  unsafe  control  poli¬ 
cies  can  lead  to  physical  damage  or  user  injury.  This  prob¬ 
lem  Is  caused  by  using  constraint-free  optimization  over  the 
shared  knowledge  during  the  transfer  process,  which  may 
lead  to  uninformative  or  unbounded  policies. 

In  this  paper,  we  address  these  issues  by  proposing  the  first 
safe  lifelong  learner  for  policy  gradient  RL  operating  in  an 
adversarial  framework.  Our  approach  rapidly  learns  high- 
performance  safe  control  policies  based  on  the  agent's  pre¬ 
viously  learned  knowledge  and  safety  constraints  on  each 
task,  accumulating  knowledge  over  multiple  consecutive 
tasks  to  optimize  overall  performance.  We  theoretically  an¬ 
alyze  the  regret  exhibited  by  our  algorithm,  showing  suh- 
lineor  dependency  of  the  form  C9(\/^)  for  R  rounds,  thus 
outperforming  current  methods.  We  then  evaluate  our  ap¬ 
proach  empirically  on  a  set  of  dynamical  systems. 
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2.  Background 

2.1.  Reinforcement  Learning 

An  RL  agent  sequentially  chooses  actions  to  minimize  its 
expected  cost.  Such  problems  are  formalized  as  Markov  de¬ 
cision  processes  (MDPs)  (A \£/,  P,  c ,  7),  where  X  C  lRcf  is 
the  (potentially  infinite)  state  space,  U  e  Erf“  is  the  set 
of  all  possible  actions,  V  :  X  x  U  x  X  — *  [0, 1]  is  a 
stale  transition  probability  describing  the  system’s  dynam¬ 
ics,  c  :  X  x  U  x  X  — t  JH  is  the  cost  function  measuring 
the  agent's  performance,  and  7  €  [0. 1]  is  a  discount  fac¬ 
tor,  At  each  lime  step  m,  the  agent  is  in  slate  xm  €  X 
and  must  choose  an  action  um  e  U,  transitioning  it  to  a 
new  state  xni+i  ~  P  (xm+i  \xin,  ulU )  and  yielding  a  cost 
c„i. |_j  =  c(xm+1,tim,  x„,).  The  sequence  of  state-action 
pairs  forms  a  trajectory  r  —  over  a 

(possibly  infinite)  horizon  M.  A  policy  ir :  X  x  U  — >  [0. 1] 
specifies  a  probability  distribution  over  slate-action  pairs, 
where  tt  (xtjjp}  represents  the  probability  of  selecting  an  ac¬ 
tion  u  in  state  x.  The  goal  of  RL  is  to  find  an  optimal  policy 
n*  that  minimizes  the  total  expected  cost. 

Policy  .search  methods  have  shown  success  in  solving 
high-dimensional  problems,  such  as  robotic  control  (Kobei 
^  Peters,  2011;  Meters  it  Sc  haul,  200Na;  Sutton  et  ul., 
2000),  These  methods  represent  the  policy  7ra(n|x)  using 
a  vector  a  €  R'*  of  control  parameters.  The  optimal  policy 
ir*  is  found  by  determining  the  parameters  o;+  that  mini¬ 
mize  the  expected  average  cost: 

-  <*> 

fc=i 

where  n  is  the  total  number  of  trajectories,  and  pa  (t^J 
and  C(t^  )  are  the  probability  and  cost  of  trajectory  T^: 

(r^) = vo  (4k))  h p 

m-D  w 

=  .  (3) 

m=U 

with  an  initial  stale  distribution  Vu  :  X  — l  [0, 1].  We  han¬ 
dle  a  constrained  version  of  policy  search,  in  which  op¬ 
timality  not  only  corresponds  to  minimizing  the  total  ex¬ 
pected  cost,  but  also  to  ensuring  that  the  policy  satisfies 
safety  constraints.  These  constraints  vary  between  applica¬ 
tions,  for  example  corresponding  to  maximum  joint  torque 
or  prohibited  physical  positions. 

2.2.  Online  Learning  &  Regret  Analysis 

In  this  paper,  we  employ  a  special  form  of  regret  minimiza¬ 
tion  games*  which  we  briefly  review  here.  A  regret  min¬ 
imization  game  is  a  triple  {£,2% if),  where  fC  is  a  non¬ 
empty  decision  set,  J  is  the  set  of  moves  of  the  adversary 


which  contains  bounded  convex  functions  from  IR'1  to  M, 
and  R  is  the  total  number  of  rounds.  The  game  proceeds 
in  rounds,  where  at  each  round  j  —  L  the  agent 

chooses  a  prediction  €  fC  and  the  environment  (he.,  the 
adversary)  chooses  a  loss  function  lj  €  *F.  At  the  end  of  the 
round,  the  loss  function  lj  is  revealed  to  the  agent  and  the 
decision  is  revealed  to  the  environment.  In  this  paper, 
we  handle  the  full-information  case,  where  the  agent  may 
observe  the  entire  loss  function  Lj  as  its  feedback  and  can 
exploit  this  in  making  decisions.  The  goal  is  to  minimize 

the  cumulative  regret  Y,f=i  [e£=i  ^M] 

When  analyzing  the  regret  of  our  methods,  we  use  a  variant 
of  this  definition  to  handle  the  lifelong  RL  case; 
r  I"  r 

9in  =  51)  h j  (0, )  -  ijf  Y  ltj  (ti)  , 
j=t  L^1 

where  l± .  (j)  denotes  the  loss  of  task  f  at  round  j. 

For  our  framework,  wc  adopt  a  variant  of  regret  minimiza¬ 
tion  called  “Follow  the  Regularized  Leader,'’  which  mini¬ 
mizes  regret  in  two  steps.  First,  the  unconstrained  solution 
0  is  determined  (see  Sect.  4.1 )  by  solving  an  unconstrained 
optimization  over  the  accumulated  losses  observed  so  far. 
Given  0,  the  constrained  solution  is  then  determined  by 
learning  a  projection  into  the  constraint  set  via  Bregman 
projections  (see  Abbas i-Yadkori  et  al.  (2013)). 

3.  Safe  Lifelong  Policy  Search 

Wc  adopt  a  lifelong  learning  framework  in  which  the  agent 
learns  multiple  RL  tasks  consecutively,  providing  it  the  op¬ 
portunity  to  transfer  knowledge  between  tasks  to  improve 
learning.  Let  T  denote  the  set  of  tasks,  each  element  of 
which  is  an  MDP.  Al  any  lime,  the  learner  may  face  any 
previously  seen  task,  and  so  must  strive  to  maximize  its 
performance  across  all  tasks.  The  goal  is  to  learn  optimal 
policies  , . . .  t  for  all  tasks,  where  policy  tt^*  for 

task  f  is  parameterized  by  aj  €  In  addition,  each 
task  is  equipped  with  safety  constraints  to  ensure  accept¬ 
able  policy  behavior:  Afat  <  bu  with  At  €  RdX(*  and 
bt  6  representing  the  allowed  policy  combinations.  The 
precise  form  of  these  constraints  depends  on  the  application 
domain,  blit  this  formulation  supports  constraints  on  (e.g.) 
joint  torque,  acceleration,  position,  etc. 

At  each  round  j,  the  learner  observes  a  set  of  71^  trajec¬ 
tories  _ fr°m  a  tas^  h  £  T,  where  each 

trajectory  has  length  Altj .  To  support  knowledge  transfer 
between  tasks,  we  assume  that  each  task’s  policy  parame¬ 
ters  €  Mrf  al  round  j  can  be  written  as  a  linear  combi¬ 
nation  of  a  shared  latent  basis  L  e  RrJxfc  with  coefficient 
vectors  stj  e  therefore,  =  Lst,..  Each  column 
of  L  represents  a  chunk  of  transferiable  knowledge;  this 
task  construction  has  been  used  successfully  in  previous 
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multi-task  learning  work  (Kumar  &  Daumc  III,  2012;  Ru- 
volo  &  Eaton,  2013;  Bou  Ammar  ct  aL,  2014).  Extending 
this  previous  work*  we  ensure  that  the  shared  knowledge 
repository  is  “informative”  by  incorporating  bounding  con¬ 
straints  on  the  Frobenius  norm  ||  ■  ||p  of  L .  Consequently, 
the  optimization  problem  after  observing  r  rounds  is: 
r 

™  £  K  ltj  (L*1, )]  +  Ml  1 1  SI  \j  +  m  1 1  LI  I*  (4) 

J  =  1 

s.t*  Atjah  <  bh  Vtj  €  T, 

Amin  (LL 1 )  >p  and  Xmax  (LLJ)  <q  , 
where  p  and  q  arc  the  constraints  on  ||£||fi  %  6  R  are 
design  weighting  parameters1,  Xr  —  {tls . . , ,  fT.}  denotes 
the  set  of  all  tasks  observed  so  far  through  round  r,  and  S 
is  the  collection  of  all  coefficients 

. mi- 


The  loss  function  ^(ac^.)  in  Eq, (4)  corresponds  to  a  pol¬ 
icy  gradient  learner  for  task  tjr  as  defined  in  Eq.  (1).  Typi¬ 
cal  policy  gradient  methods  (Kober  &  Peters,  2011;  Sutton 
et  ah,  2000)  maximize  a  lower  hound  of  the  expected  cost 
Ifj  which  can  be  derived  by  taking  the  logarithm  and 

applying  Jensen  s  inequality: 


log[ft ,  (oi,)]  =  log 


(5) 


>  log[n(J  +  E 


-i 


-f- const . 


Therefore,  our  goal  is  to  minimize  the  following  objective: 


t=l  m=0 


■w  l|5|?  +  «  II^IIf 


(6) 


s.t.  ^  b/j  Vfj  6  2r 

Amin  (IsL  )  ^  p  and  Aj,,.^  (Tt  )  5  9  ■ 


3-1.  Online  Komi  illation 

The  optimization  problem  above  can  be  mapped  to  the  stan¬ 
dard  online  learning  framework  by  unrolling  L  and  S  into 
a  vector  0  =  [vec(L)  vec(S)]1  €  Kdfc+*lrlJ  Choosing 

**>(«)  -  XaEti  Oj  +  , i ,  E Zm'Z'  91  .  and  0,(5)  - 

Oj  i  (0)  4-  (0),  wre  can  write  the  safe  lifelong  policy 

search  problem  (Bq,  (6))  as: 

0r+i  =  argmm£2v(0)  ,  (7) 

where  fC  C  Rdfe+fcl71  is  the  set  of  allowable  policies  under 
the  given  safety  constraints.  Note  that  the  loss  for  task  tj 

"We  describe  later  how  to  set  the  j/'s  later  in  Seel.  5  to  obtain 
regret  bounds,  and  leave  them  as  variables  now  for  generality. 


can  be  written  as  a  bilinear  product  in  0. 


u,m  -  -ff 1^)1 

^  Jt=l  m=  0  *■  i  -I 


■0, 

■  #d(fc-i)+i 

0rM:+l 

= 

> «...  = 

_0*  .. 

0ilk 

.  0(d+l)fc+l  , 

We  see  that  the  problem  in  Eq.  (7)  is  equivalent  to  Eq.  (6) 
by  noting  that  at  r  rounds,  f2r  =  i  Wtjhj  W  4-  O[j(0), 

4.  Online  Learning  Method 

We  solve  Eq.  (7)  in  two  steps.  First,  we  determine  the 
unconstrained  solution  #P+i  when  fC  =  Rdfc+fclT!  (see 
Sect.  4.1).  Given  0r=hi,  we  derive  the  constrained  solution 
0r+i  by  learning  a  projection  ProjnjC  ^0r+ij  to  the  con¬ 
straint  set  K  C  R^+^\ ,  which  amounts  to  minimizing 
the  Bregman  divergence  over  f^r .(<?)  (see  Sect,  4.2)-.  The 
complete  approach  is  given  in  Algorithm  1  and  is  available 
as  a  software  implementation  on  the  authors'  websites. 

4,1*  Unconstrained  Policy  Solution 

Although  Eq.  (6)  is  not  jointly  convex  in  both  L  and  S,  it 
is  separably  convex  (lor  log-concave  policy  distributions). 
Consequently,  we  follow  an  alternating  optimization  ap¬ 
proach,  first  computing  L  while  holding  $  fixed,  and  then 
updating  S  given  the  acquired  L .  We  detail  this  process  for 
two  popular  PG  learners,  eREINFORCE  (Williams,  1992) 
and  eNAC  (Peters  &  Schaal,  2008b).  The  derivations  of  the 
update  rules  below  can  be  found  in  Appendix  A. 

These  updates  are  governed  by  learning  rates  /?  and  A  that 
decay  over  time;  ff  and  A  can  be  chosen  using  line-search 
methods  as  discussed  by  Boyd  &  Vandenberghe  (2004),  In 
our  experiments,  we  adopt  a  simple  yet  effective  strategy, 
where  fi  =  rj~l  and  A  —  cj~l ,  with  0  <  c.  <  li_ 

Step  I ;  Updating  L  F folding  S  fixed,  the  latent  repository 
can  be  updated  according  to: 

Lp+ ,  =  Lp  -  7)£vte,.(i,  S)  (eREINFORCE) 

Lp+ 1  =  Lp-  i&G  1  {Lp,  Sp)VLer{L,  S )  (eNAC) 

with  learning  rate  ?;£  £  ffi,  and  G~ 1  ( L .  S)  as  the  inverse 
of  the  Fisher  information  matrix  (Peters  &  Schaal,  2008b). 

In  the  special  case  of  Gaussian  policies,  the  update  Tor  i 

IIn  Sect,  4.2 h  we  linearize  the  loss  around  the  constrained  so¬ 
lution  of  the  previous  round  to  increase  stability  and  ensure  con¬ 
vergence.  Given  the  linear  losses,  it  suffices  to  solve  the  Bregman 
divergence  over  the  regularizer,  reducing  the  computational  cost. 
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can  be  derived  in  a  closed  form  as  X/?+i  =  ZLlVh,  where 

zt  =2 Evec(*^)(4,T®S0 


wi  m=a 

M(.  =  l 


VL  — 
_2 


e^e  e- • 

j  h  *>  A  =  1  tn=0 


erf.  is  the  covariance  of  the  Gaussian  policy  for  a  task  tj. 


and 


denotes  the  state  features. 


Step  2:  Updating  $  Given  the  fixed  basis  L,  the  coeffi¬ 
cient  matrix  $  is  updated  column-wise  for  all  tj  E  Tr: 


a\+i  ' 


*x+i 


-4VA!ier(L,S) 


(cREINFORCE) 


4+i  =  4*"  -  vxsG- 1  f Lfi ,  Sfi)Vet.  er  (L.  S)  (eNAC) 

with  learning  rate  7)^  g  R.  For  Gaussian  policies,  the 
closed-form  of  the  update  is  .  —  Z~t\  vSl  ,  where 

Ktj  Mt.- 1 

-^~E 


—  2/ii/^xjt  4- 


SS,  nh°t, 

ra( .  A/(  .—  1 


^  =  E  E<e^T*  ■ 

J  *>  fr=l  m=0 

4.2.  Constrained  Policy  Solution 

Once  we  have  obtained  the  unconstrained  solution  5,.+^ 
(which  satisfies  Eq.  (7),  but  can  lead  to  policy  param¬ 
eters  in  unsafe  regions),  we  then  derive  the  constrained 
solution  to  ensure  safe  policies.  We  learn  a  projection 
P'O-in,  (^r+ij  from  0r+i  to  the  constraint  set : 

0f +i  -  arg  min  BUrX  (#  A+  i  )  ,  {&) 

where  Bu,  tc  {&,  0r+i)  is  the  Bregman  divergence  over  flr: 

Bn.  ,K  (&i  #r+l^  =  £T  (0)  -  Qr(#r+l) 

-  trace  ^V0nr ■ (0) | .  (&  -  0)  +  ijJ  . 

Solving  Rq.  (8)  is  computationally  expensive  since  fiP.(0) 
includes  the  sum  back  to  the  original  round.  To  remedy  this 
problem,  ensure  the  stability  of  our  approach,  and  guar¬ 
antee  that  the  constrained  solutions  For  all  observed  tasks 
lie  within  a  hounded  region,  we  linearize  the  current-round 
loss  function  lty  (0)  around  the  constrained  solution  of  the 
previous  round  0, ■  ; 


Given  the  above  linear  form,  wc  can  rewrite  the  optimiza¬ 
tion  problem  in  Eq.  (8)  as: 

0r+l  =  arg  min  BUi!.rc  (4  0r+i)  (10) 

Consequently,  determining  safe  policies  for  lifelong  policy 
search  reinforcement  learning  amounts  to  solving: 

mill  Mi||S||f  +  m||£|If 

+  trace  (  -51"  I  S  I  +  2^2trHce  I  L I  L  ) 

\  l^r+l  /  \  l^r+1  } 

s.l.  Ai  Lstj  <  bt/  V(j£lr 

LLt  <  pi  and  LL r  >  ql  . 

To  solve  the  optimization  problem  above,  we  start  by  con¬ 
verting  the  inequality  constraints  to  equality  constraints 
by  introducing  slack  variables  c(.  >  0,  We  also  guaran¬ 
tee  that  these  slack  variables  are  bounded  by  incorporating 
IK II  <  c™,  Vtj-  e  {1 . \T\y. 


nnc 


:  +  /'2||£[|f 


4-  2/i2trace  ^LT  L^J  +  2$% trace  5^ 

s.t.  At .  Lst .  =  bfj  -  ct .  Vtj  E  lr 

otj  >0  and  \\ctj  ||s  b  Qnax  vtj  e  ir 
LLJ  ^  pi  ond  LLJ  >  qi  . 

With  this  formulation,  learning  Proj^  ^  ^0r.+  |^|  amounts 
to  solving  second-order  cone  and  semi -definite  programs. 

4.2.1.  Semi-Definite  Program  for  Learning  L 

This  section  determines  the  constrained  projection  of  the 
shared  basis  L  given  fixed  S  and  C,  We  show  that  L  can  be 
acquired  efficiently,  since  this  step  can  be  relaxed  to  solving 
a  semi -definite  program  in  LLT  (Boyd  Sl  Vandenberghe, 
2004).  To  formulate  the  semi-de finite  program,  note  that 


trace 


Khl 


Nfc 


E  i'i)| 


hr  (U)  =  fir 


1 T 

=  I 

l\ 

it  , 

m 

A  I  v 

^+1>2  \i~l 

nj' trace  (£iT)  . 


where 


From  the  constraint  set,  we  recognize: 


1 

v#itr  (8)  . 

u 

u  - 

ft,  (#)  I .  -  (8)  .  »v 

\er.  \9,  J 

,  ti  = 

l 

slLT  =  (htj  -  c-h)T  (a’j) 

=>  sJ/LTL8tj  =  aj.o-i,  with  a,.  =  AftJ  (b,s  -  c,  j 
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Algorithm  1  Safe  Online  Lifelong  Policy  Search 
1:  Inputs:  Total  number  of  rounds  R,  weighting  factor 
7{  —  lf\/R,  regularization  parameters  ^  and  con¬ 
straints  p  and  q,  number  of  latent  basis  vectors  k. 

2:  S  =  zcros(fc,  jT|),  L  =  diagfc(£)  withp  <  <  q 

3:  for  j  =  1  to  R  do 

4:  tj  4-  samplelhsk(),  and  update  Tj 

5 :  Co  in pu tc  u  n constrai ned  sol  ut ion  Qj + 1  (  See  t .  4  J ) 

6:  Fix  S  and  C,  and  update  L  (Sect.  4.2.1) 

7:  Use  updated  L  to  derive  S  and  C  (Sect.  4.2.2) 

8:  end  for 

lh  Output:  Safety-constrained  L  and  S 


Since  spectrum  (LLT)  =  spectrum  (LTL)_  we  can  write: 

l\ 


min  ju2trace(X)  +  2/i2 


i/trace  (X) 


s,t.  sjj  X  stj  =  aj  atj  V£;  £  X, 

X  <  pi  and  X  >  ql  ,  with  X  =  LJ L  . 

4.2.2.  Second-Order  Cone  Program  for 
Learning  Task  Projections 

Having  determined  L,  we  can  acquire  S  and  update  C 
by  solving  a  second-order  cone  program  (Boyd  &  Vanden- 
berghe,  2004)  of  the  following  form: 

r  r 

mm  fit  V  ||  stj  || %  +  2 pi  V  sj  stj 

. .  - r  “  1 

1  1  J  =  1  J  =  1 

s.t  AtjLstj  =  bti  -ctj 

Ctj  >  0  \\ctj\%  <  eLc  Vt;  e  Xr  . 


5*  Theoretical  Guarantees 

This  section  quantifies  the  performance  of  our  approach  by 
providing  formal  analysis  of  the  regret  after  J?  rounds.  We 
show  that  the  safe  lifelong  reinforcement  learner  exhibits 
sublinear  regret  in  the  total  number  of  rounds,  formally* 
we  prove  the  following  theorem: 

Theorem  I  (Sublinear  Regret).  After  ft  rounds  and  choos¬ 
ing  Vf;  €  2r  =  jj  =  =  diagh(Qt  with 


losses  for  policy  search  RL  arc  too  restrictive  given  a  single 
operating  point,  as  discussed  previously,  we  remedy  this 
problem  by  generalizing  to  the  case  of  piece- wise  linear 
losses,  where  the  linearization  operating  point  is  a  resultant 
Of  the  optimization  problem.  To  bound  the  regret*  we  need 
to  hound  the  dual  Euclidean  norm  (which  is  the  same  as  the 
Euclidean  norm)  of  the  gradient  of  the  Joss  function,  then 
prove  Theorem  I  by  bounding:  (1)  task  tf  s  gradient  loss 
(Sect.  5. 1  )*  and  (2)  linearized  losses  with  respect  to  L  and 
S  (Sect.  5.2). 

5.1.  Bounding  ’s  Gradient  Loss 

We  start  by  stating  essential  lemmas  for  Theorem  I ;  due  to 
space  constraints,  proofs  for  all  lemmas  are  available  in  the 
supplementary  material.  Here*  we  bound  the  gradient  of  a 
loss  function  ltj  (0)  at  round  r  under  Gaussian  policies3. 
Assumption  1.  We  assume  that  the  policy  for  a  task  tj  is 
Gaussian,  the  action  set  U  is  bounded  by  umax,  and  the 
feature  set  is  upper-bounded  by  d>max. 

Lcitunn  1.  Assume  task  tj 's  policy  at  round  r  is  given  by 

Ttatj  |a?m '  |  .  =  ff  |a  J.  ®  ,  07^  , 

for  states  zif' <J  *  E  Xt.  and  actions  uJn’ <J  *  e  Ut:. .  For 

M««j)  =  X]  Z  los  ("»>'  t,})]' the 

fc=l  m= 0 

gradient  satisfies  ||  < 

(««■*  +  max  {||i4j||a  (||6it||2  +  CbiM)}  *m**]  ^  ir 
atj  y  t*  i  j 

for  ail  trajectories  and  all  tasks,  with  umjlx  = 

is?  { h™  il and  *-■-?!?  { ll*  (*- °)  1 1 J 

5.2.  Bounding  Linearized  Losses 

As  discussed  previously,  we  linearize  the  loss  of  task  tr 
around  the  constrain!  solution  of  the  previous  round  Gr.  To 
acquire  the  regret  bounds  in  Theorem  1 ,  the  next  step  is  to 

II* 

/t  =  ft-, 

’  h 

can  be  easily  seen 


9, 


of  Eq.  (9).  It 


diagk  ( ■ )  being  a  diagonal  matrix  among  the  k  columns  of  i 

1 

| 

L>  P  <  <  q>  and  S\  =  0*  x\ru  the  safe  lifelong  rein- 

l0l 

*4 

< 

2 

+ 

3! 

XehJO) 

(ID 

3 

a 

XT;  (®j)  -  1<AU)  =  °  (VS)  for  any  u  E  X. 

j=i 

Proof  Roadmap:  The  remainder  of  this  section  completes 
our  proof  of  Theorem  I ;  further  details  are  given  in  Ap- 
pendix  B.  We  assume  linear  losses  for  all  tasks  in  the  con¬ 
strained  case  in  accordance  with  Sect.  4.2.  Although  linear 


H 


X 


*  Please  note  that  deri  vations  for  other  forms  of  log-concave 
policy  distributions  could  he  derived  in  similar  manner.  In  this 
work,  wc  focus  on  Gaussian  policies  since  they  cover  a  broad 
spectrum  of  real-world  applications. 
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Since 


;<r  («)  L 


can  be  bounded  by  d,rt  (see  Sect,  2), 


the  next  step  is  to  bound 


Veltt.  (9)  . 


T  and  \\Qrh. 


Lemma  2,  The  norm  of  the  gradient  of  the  loss  Junction 
evaluated  at  6r  satisfies 
n  m2 


Vsltr(6 1)  <  Vfl[rilr(@) 


\q  x  d 


Ul»(* 

ifp  ,  { I K I  \l  O'6'*  i + c-0  }+i))  - 


To  finalize  the  hound  of 


as  needed  for  deriving 


the  regret*  we  must  derive  an  upper- bound  for  ||0i  ||^ 
Lemma  3.  The  to  norm  of  the  constraint  solution  at  round 
t  —  1*  ||^r  ||  2  is  bounded  by 


IlSrlljSffXd 


1  +  12,,.^ 


if  ,{IKl£  (ii^n* +c™<)J} 

where  |2r  -  i|  is  the  number  of  unique  tasks  observed  so  far. 


max 

i 


Given  the  previous  two  lemmas,  we  can  prove  the  bound 


for 


fir  L 


Lemma  4.  The  norm  of  the  linearizing  term  of  It t  (9) 


around  9t  , 

fir  L 


fir 


0, 


,  is  bounded  by 


veftr(fl)L 


(i  +  RIb) 


<  7i  (*■)(!  +  7a(r))  +  St  , 


I'M 


rrm 


(12) 


i  and 


where  <5;  f  ^  is  the  constant  upper-bound  on 

7iM  =  — f  tW 

n>wl  LV 

"1  |I;:':lX  {  I  T  ^1 1 2  (H6.JI.  +  C  )  }  'l*1''  i 

x  (i!  v'f^q\l'lnM  ^  m + c™^} + mTj 

TaW  <  y/q  x  d 


max  ^  max 


+  yf%Z 


53*  Completing  the  Proof  of  Sut) linear  Regret 

Given  the  lemmas  in  the  previous  section*  we  now  can  de¬ 
rive  the  subhnear  regret  bound  given  in  Theorem  I .  Using 


results  developed  by  Abbasi-Yadkori  et  al.  (2013),  it  is  easy 
to  see  that 

Vefto  -  Veflo  ■ 

From  the  convexity  of  the  regularize^  we  obtain: 

nn  (»,)  >  n0  (»J+1)  +  (vfln„  (ej+1)  .0,  -  0j+i) 

+\h-^i  ■ 

We  have: 

||0J-0y+i||J<%||/,J|^ 

Therefore,  for  any  u  €  fC 

Y  'ih  (l>,  (*#)  -  k  (**))  <  Y  ih>  I  ia  | 

j=i  i  ■ 

+  )  . 

Assuming  that  V tj  —  q,  wc  can  derive: 

V  f 

Y  {l‘>  (®j)  -  M“))  <  nY  fi\, 

i-i  j=i 

+  Vi(noM-n0(»1))  - 

The  following  lemma  finalizes  the  proof  of  Theorem  I: 
Lemma  5.  After  B  rounds  with  Vf  j  qtj  =  q  =  — ,  for  any 
u  £  fC  we  have  that  ^  (&j)  ~  4j  (u)  <  O 

Proof  From  (12),  it  follows  that 

i,2 


A L 


<7:<(JR)  +  47?(Rh!(R) 


<  rA&)  +  s^7r(tfk<q  i  +  |z*-i| 


'l1 


X(«^_i{||<||2(||6(J|i  +  Ciml)2}j 

with  73  (ij)  =  4-  2max^ejn_1  Sf^.  Since 


\Tr-i\  <  |T|,  we  have  that  \\ftj 


\9r 


<  75(i?)|T|  with 


\Zr- 1 1  ~ j  ^A\t  j  |  J}\btk  1 12  +  dffjar)  J  ■ 


75  =  Sd/p2<nl(R)  t  (  (llAjjll  (||6*  J2  +  Cm**)2}. 

Given  that  rio(“)  S  +  7s(H)|T|t  with  7s(/7!)  being  a 
constant,  we  have: 

(f) <  ‘iYW.R)\T\ 

J=1  j-1 

+  i(9d+7S(B)in-n0(«,)) 

Initializing  L  and  S:  We  initialize  L  —  diagfc(C),  with 

I  Pi 

p  <  (f1  <  q  and  S  =  OfcX|Tl  ensure  the  invcrlibilily 
\&l 
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of  L  and  that  the  constraints  arc  met.  This  leads  to 

j=l  j=l 

+  +  TftW|71-w*0  ■ 

Choosing  Vfy  rjtj  =  ij  =  1/v/^> acquire  sublincar  regret, 
finalizing  the  statement  of  Theorem  I : 

r 

V(M^) -'«/“>)  - 

j=i 

+  v^(g(/  +  76{H)|T|-MfcO 

<  VR^mri  +  qdl5{R)\T\  -  U2k<) 

<  o  (Zfl)  .  □ 

6.  Experimental  Validation 

To  validate  the  empirical  performance  of  our  method,  we 
applied  our  safe  online  PG  algorithm  to  learn  multiple  con¬ 
secutive  control  tasks  on  three  dynamical  systems  (Fig¬ 
ure  I ).  To  generate  multiple  tasks,  we  varied  the  parameter¬ 
ization  of  each  system,  yielding  a  set  of  control  tasks  from 
each  domain  with  varying  dynamics.  The  optimal  control 
policies  for  these  systems  vary  widely  with  only  minor 
changes  in  the  system  parameters,  providing  substantial  di¬ 
versity  among  the  tasks  within  a  single  domain. 


Figure  l.  Dynamical  systems  used  in  the  experiments;  a}  simple 
mass  system  (left),  b)  cart-pole  (middle),  and  c)  quad  rotor  un¬ 
manned  aerial  vehicle  (right). 

Simple  Mass  Spring  Damper:  The  simple  mass  (SM) 
system  is  characterized  by  three  parameters:  the  spring  con¬ 
stant  k  in  N/m,  the  damping  constant  d  in  Ns/m  and  the 
mass  m  in  kg.  The  system’s  state  is  given  by  the  position  a; 
and  sc  of  the  mass,  which  varies  according  to  a  linear  force 
Fr  The  goal  is  to  train  a  policy  for  controlling  the  mass  in 
a  specific  state  gt<A  =  {aw,®nrf}. 

Cart  Pole:  The  cart -pole  (CP)  has  been  used  extensively 
as  a  benchmark  for  evaluating  RL  methods  (Busoniu  et  aL, 
2010),  CP  dynamics  are  characterized  by  the  cart’s  mass 
mc  in  kg,  the  pole’s  mass  tnp  in  kg,  the  pole’s  length  in 
meters,  and  a  damping  parameter  d  in  Ns/m.  The  stale  is 
given  by  the  cart's  position  x  and  velocity  x,  as  well  as  the 
pole's  angle  9  and  angular  velocity  9.  The  goal  is  to  train  a 
policy  that  controls  the  pole  in  an  upright  position, 

6,1,  Experimental  Protocol 

We  generated  10  tasks  for  each  domain  by  varying  the  sys¬ 
tem  parameters  to  ensure  a  variety  ol  tasks  with  diverse  op¬ 


timal  policies,  including  those  with  highly  chaotic  dynam¬ 
ics  that  are  difficult  to  control.  Wc  ran  each  experiment  for 
a  total  of  R  rounds,  varying  from  150  for  the  simple  mass 
to  10, 000  for  the  quad  rotor  to  train  L  and  5,  as  well  as 
for  updating  the  PG-ELLA  and  PG  models.  At  each  round 
j,  the  learner  observed  a  task  tj  through  50  trajectories  of 
150  steps  and  updated  L  and  $t  r ,  The  dimensionality  k  of 
the  latent  space  was  chosen  independently  for  each  domain 
via  cross-validation  over  3  tasks,  and  the  learning  step  size 
for  each  task  domain  was  determined  by  a  line  search  after 
gathering  10  trajectories  of  length  150.  We  used  eNAC,  a 
standard  PG  algorithm,  as  the  base  learner, 

Wc  compared  our  approach  to  both  standard  PG  (i.e,, 
eNAC)  and  PG-ELLA  (Bou  Ammar  et  ah,  2014),  examin¬ 
ing  both  the  constrained  and  unconstrained  variants  of  our 
algorithm.  We  also  varied  the  number  of  iterations  in  our  al¬ 
ternating  optimization  from  10  to  100  to  evaluate  the  effect 
of  these  inner  iterations  on  the  performance,  as  shown  in 
Figures  2  and  3.  For  the  two  MTL  algorithms  (our  approach 
and  PG-ELLA),  the  policy  parameters  for  each  task  tj  were 
initialized  using  the  learned  basis  (i,e.t  =  LsJ} ).  We 
configured  PG-ELLA  as  described  by  Bou  Ammar  el  al. 
(2014),  ensuring  a  fair  comparison.  For  the  slandunJ  PG 
learner,  we  provided  additional  trajectories  in  order  to  en¬ 
sure  a  fair  comparison,  as  described  below. 

For  the  experiments  with  policy  constraints,  wc  generated 
a  set  of  constraints  (At.  6*}  for  each  task  that  restr  icted  the 
policy  parameters  to  pre-spedfied  “safe"  regions,  as  shown 
in  Figures  2(c)  and  2(d).  We  also  tested  different  values  for 
the  constraints  on  L,  varying  p  and  q  between  0.1  to  10; 
our  approach  showed  robustness  against  this  broad  range, 
yielding  similar  average  cost  performance. 

6*2,  Results  cm  Benchmark  Systems 

Figure  2  reports  our  results  on  the  benchmark  simple  mass 
and  carl-pole  systems.  Figures  2(a)  and  2(b)  depicts  [he 
performance  oflhc  learned  policy  in  a  lifelong  learning  set¬ 
ting  over  consecutive  unconstrained  tasks,  averaged  over 
all  10  systems  over  100  different  initial  conditions.  These 
results  demonstrate  that  our  approach  is  capable  of  outper¬ 
forming  both  standard  PG  (which  was  provided  with  50 
additional  trajectories  each  iteration  to  ensure  a  more  fair 
comparison)  and  PG-ELLA,  both  in  terms  of  initial  perfor¬ 
mance  and  learning  speed.  These  figures  also  show  that  the 
performance  of  our  method  increases  as  it  is  given  more 
alternating  iterations  per-round  Tor  filling  L  and  5. 

We  evaluated  the  ability  of  these  methods  to  respect  safety 
constraints,  as  shown  in  Figures  2(c)  and  2(d),  The  thicker 
black  lines  in  each  figure  depict  the  allowable  “safe"  region 
of  the  policy  space.  To  enable  online  learning  pc  Mask,  the 
same  task  lj  was  observed  on  each  round  and  the  shared 
basis  L  and  coefficients  stj  were  updated  using  alternating 
optimization.  We  then  plotted  the  change  in  the  policy  pa- 
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CeO  Simple  Mass 


(b)  Can  Pole 


(c)  Trajectory  Simple  Mass  (d)  Trajectory  Can  Pole 


Figure  2,  Results  on  benchmark  simple  mass  and  cart-pole  systems.  Figures  (a)  and  (b)  depict  performance  in  lifelong  learning  scenarios 
over  consecutive  unconstrained  lasts,  showing  that  our  approach  outperforms  standard  PG  and  PG-ELLA,  Figures  (c)  and  (d)  examine 
the  ability  of  these  method  to  abide  by  safety  constraints  on  sample  constrained  tasks,  depicting  two  dimensions  of  the  policy  space  (01 
vs  a 2)  and  demonstrating  that  our  approach  abides  by  the  constraints  (the  dashed  bhtck  region). 


rameter  vectors  per  iterations  (i.e,,  Off  —  LstJ )  for  each 
method,  demonstrating  that  our  approach  abides  by  the 
safety  constraints,  while  standard  PG  and  PG-ELLA  can 
violate  them  (since  they  only  solve  an  unconstrained  opti¬ 
mization  problem).  In  addition,  these  figures  show  that  in¬ 
creasing  the  number  oT  alternating  iterations  in  our  method 
causes  it  to  lake  a  more  direct  path  to  the  optimal  solution. 

6,3.  Application  to  Quad  rotor  Control 

We  also  applied  our  approach  to  the  more  challenging  do¬ 
main  of  quadrotor  control.  The  dynamics  of  the  quadro- 
tor  system  (Figure  I )  are  influenced  by  inertial  constants 
around  c^b,  and  egh^T  Thrust  factors  influencing  how 
the  rotor's  speed  affects  the  overall  variation  of  the  system's 
state,  and  the  lengths  of  the  rods  supporting  the  rotors.  Al¬ 
though  the  overall  state  of  the  system  can  be  described  by 
a  12-dimensional  vector,  we  focus  on  stability  and  so  con¬ 
sider  only  six  of  these  state- variables.  The  quadrotor  sys¬ 
tem  has  a  high-dimensional  aetion  space,  where  the  goal  is 
to  control  the  four  rotational  velocities  of  the  ro¬ 

tors  to  stabilize  the  system.  To  ensure  realistic  dynamics, 
we  used  the  simulated  model  described  by  (BoLiubdallah, 
2007;  Voos  &  Bou  Ammar,  2010),  which  has  been  verified 
and  used  in  the  control  of  physical  quad  rotors. 

We  generated  10  different  quadrotor  systems  by  varying 
the  inertia  around  the  x,  y  and  z-axes.  We  used  a  linear 
quadratic  regulator,  as  described  by  Bouabdallah  (2007), 
to  initialize  the  policies  in  both  the  learning  and  testing 
phases.  We  followed  a  similar  experimental  procedure  to 
that  discussed  above  to  update  the  models. 

Figure  3  shows  the  performance  of  the  unconstrained  solu¬ 
tion  as  compared  to  standard  PG  and  PG-ELLA,  Again,  our 
approach  dearly  outperforms  standard  PG  and  PG-ELLA 
in  both  the  initial  performance  and  learning  speed.  We 
also  evaluated  constrained  tasks  in  a  similar  manner,  again 
showing  that  our  approach  is  capable  of  respecting  con¬ 
straints,  Since  the  policy  space  is  higher  dimensional,  we 
cannot  visualize  it  as  well  as  the  benchmark  systems,  and  so 
instead  report  the  number  of  iterations  it  takes  our  approach 


Figure  3.  Performance  un  quadrotor  control. 


Figure  4.  Average  number  of  task  observations  before  acquiring 
policy  parameters  that  abide  by  the  constraints,  showing  that  our 
approach  immediately  projects  policies  to  safe  regions, 

to  project  the  policy  into  the  safe  region.  Figure  4  shows 
that  our  approach  requires  only  one  observation  of  the  task 
to  acquire  safe  policies,  which  is  substantially  lower  then 
standard  PG  or  PG-ELLA  (e.g.,  which  require  545  and  510 
observations,  respectively,  in  the  quadrotor  scenario) - 

7.  Conclusion 

We  described  the  lirst  lifelong  PG  learner  that  provides  sub- 
linear  regret  £2(\/70  with  11  total  rounds.  In  addition,  our 
approach  supports  safety  constraints  on  the  learned  policy, 
which  are  essential  for  robust  learning  in  real  applications. 
Our  framework  formalizes  lifelong  learning  as  online  MTL 
with  limited  re  sou  ices,  and  enables  safe  transfer  by  sharing 
policy  parameters  through  a  latent  knowledge  base  that  is 
efficiently  updated  over  time. 
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Abstract 

Online  multitask  learning  is  an  important  capa- 
biliiy  for  lifelong  learning  agents,  enabling  them 
10  acquire  models  Tor  diverse  tasks  over  time  and 
rapidly  learn  new  tasks  by  building  upon  prior  ex¬ 
perience.  However,  recent  progress  It) ward  lifelong 
reinforcement  learning  (RL)  has  been  limited  to 
learning  from  within  a  single  task  domain.  For  truly 
versatile  lifelong  learning,  the  agent  must  be  able  to 
autonomously  transfer  knowledge  between  differ¬ 
ent  task  domains.  A  Tew  methods  Tor  cross- domain 
transfer  have  been  developed,  but  these  methods  are 
computationally  inefficient  for  scenarios  where  the 
agent  must  learn  tasks  consecutively. 

In  this  paper,  we  develop  the  first  cross-domain  life¬ 
long  RL  framework.  Our  approach  efficiently  op¬ 
timizes  a  shared  repository  of  transferable  knowl¬ 
edge  and  learns  projection  matrices  that  specialize 
that  knowledge  to  different  task  domains.  We  pro¬ 
vide  rigorous  theoretical  guarantees  on  the  stability 
Of  this  approach,  and  empirically  evaluate  its  per¬ 
formance  on  diverse  dynamical  systems.  Our  re¬ 
sults  show  that  the  proposed  method  can  learn  ef¬ 
fectively  from  interleaved  task  domains  and  rapidly 
acquire  high  performance  in  new  domains. 

1  Introduction 

Reinforcement  learning  (RL)  provides  the  ability  to  solve 
Ei igh -dimensional  control  problems  when  detailed  knowledge 
of  the  system  is  not  available  a  priori.  Applications  with  these 
characteristics  are  ubiquitous  in  a  variety  of  domains,  from 
robotic  control  [Busoniu  ei  al.,  2008:  Smart  &  Kaelbling, 
2(X)2|  to  stock  trading  [Dempster  &.  Leemans,  2006 1.  How¬ 
ever,  in  many  cases,  RL  methods  require  numerous  lengthy 
interactions  with  the  dynamical  system  in  order  to  learn  an 
acceptable  controller.  Unfortunately,  the  cost  of  acquiring 
these  interactions  is  often  prohibitively  expensive  (in  terms 
Of  time,  expense,  physical  wear  on  the  robot,  etc  ). 

This  issue  of  obtaining  adequate  experience  only  wors¬ 
ens  when  multiple  control  problems  must  be  solved.  This 
need  arises  in  two  main  cases:  (I)  when  a  single  agent  must 
learn  to  solve  multiple  tasks  (e.g.,  a  rcconfigurable  robot),  and 
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(2)  when  different  robots  must  each  learn  to  solve  a  single 
control  problem.  Transfer  learning  [Taylor  &  Stone,  2009] 
and  multi- task  learning  (MTL)  methods  I  Li  et  aL,  2009; 
Laz.aric  El  Ghavamzadeh,  2010]  have  been  developed  to  re¬ 
duce  the  amount  of  experience  needed  for  individual  tasks  by 
allowing  the  agent  to  reuse  knowledge  from  other  tasks. 

In  both  transfer  learning  and  MTL,  the  difficulty  of  trans¬ 
ferring  knowledge  between  tasks  increases  with  the  diversity 
of  the  tasks.  In  the  extreme  ease,  the  underlying  systems 
(and  their  action  and  slate  representations)  are  entirely  dif¬ 
ferent,  making  direct  knowledge  reuse  impossible.  Several 
approaches  to  this  cross-domain  transfer  problem  have  been 
develo|)ed,  but  these  methods  require  an  inter-task  mapping 
of  state/action  spaces  that  is  either  hand-coded  [Taylor  et  al., 
2007]  or  learned  in  a  computationally  inefficient  manner  that 
does  not  scale  to  more  than  a  few  task  domains  (see  Sec¬ 
tion  3).  However,  the  problem  of  cross-domain  transfer  has 
not  yet  been  studied  in  lifelong  learning  settings  [Thrun  &. 
O'Sullivan,  1996;  Ruvolo  &  Eaton,  2013],  in  which  the  agent 
must  learn  multiple  tasks  consecutively  with  the  goal  of  opti¬ 
mizing  performance  across  all  previously  learned  tasks. 

We  address  this  problem  by  developing  the  first  algorithm 
for  lifelong  RL  that  supports  efficient  and  autonomous  cross¬ 
domain  transfer  between  multiple  consecutive  tasks  from  dif¬ 
ferent  domains.  Specifically,  our  approach  provides  the  fol¬ 
lowing  advantages  over  existing  approaches:  (1)  it  improves 
current  cross-domain  transfer  learning  methods  by  optimiz¬ 
ing  performance  across  all  tasks,  (2)  it  learns  multi-task  cross- 
domain  mappings  autonomously,  and  (3)  it  can  share  knowl¬ 
edge  between  a  multitude  of  tasks  from  diverse  domains  in  a 
computationally  efficient  manner.  To  enable  effective  cross¬ 
domain  lifelong  learning,  our  approach  learns  a  repository  of 
shared  knowledge  along  with  projection  matrices  that  spe¬ 
cialize  this  shared  knowledge  to  each  task  domain.  We  pro¬ 
vide  theoretical  guarantees  on  convergence  that  show  this  ap¬ 
proach  becomes  increasingly  stable  as  the  number  of  domains 
or  tasks  grows  large,  and  demonstrate  the  empirical  effective¬ 
ness  of  cross-domain  lifelong  learning  between  streams  ofin- 
terleaved  tasks  from  diverse  dynamical  systems,  including  bi¬ 
cycles  and  helicopters, 

2  Background  on  Policy  Gradient  RL 

In  a  reinforcement  learning  (RL)  problem,  an  agent  must  de¬ 
cide  how  to  sequentially  select  actions  to  maximize  its  ex- 
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peeled  relum.  In  contrast  to  classic  stochastic  optimal  con¬ 
trol  methods  [Rertsekas,  1995],  RL  approaches  do  not  re¬ 
quire  detailed  prior  knowledge  of  the  system  dynamics  or 
goal'  Instead  these  approaches  learn  optimal  control  poli¬ 
cies  through  interaction  with  the  system  itselT.  Rl_  prob¬ 
lems  are  typically  formalized  as  a  Markov  decision  process 
(MDP)  {X,  A,  P,  R,  7),  where  X  C  Rd  is  the  (potentially 
infinite)  set  of  states,  A  is  the  set  of  actions  that  the  agent 
may  execute,  P  :  X  x  A  x  X  — >  [0, 1]  is  a  state  transi¬ 
tion  probability  function  describing  the  task  dynamics,  /?  : 
X  x  ^4  x  X  — *  E  is  the  reward  function  measuring  the  per¬ 
formance  of  the  agent,  and  7  €  [II,  1)  is  the  reward  discount 
factor.  At  each  time  step  m.  the  agent  is  in  state  xm  €  X  and 
must  choose  an  action  am  €  A ,  transitioning  it  to  a  new 
state  ccm+ 1  ^  P(xm+i  |  xm)om)  and  yielding  a  reward 
rm+i  =  R(xm, am. Xjh-i-i}.  The  sequence  of  state-action 
pairs  forms  a  trajectory  r  —  Over  a  (possibly 

infinite)  horizon  M.  A  policy  k  :  X  x  A  — >  [0,  Ij  specifies 
a  conditional  probability  distribution  over  actions  given  the 
current  state.  The  RL  agent's  goal  is  to  find  a  policy  it*  that 
maximizes  the  expected  per-time-slep  reward. 

Policy  gradient  methods  [Peters  et  aiy  2005;  Sutton  et  al., 
19991  represent  the  agent's  policy  ir  as  a  function  defined  over 
a  vector  9  e  R'J  of  control  parameters.  With  this  parameter¬ 
ized  policy,  we  can  compute  the  optimal  parameters  9 *  that 
maximize  the  expected  average  reward: 


J(0)  =  E 


it'A-L 


-  I  p&( r)SR(r)dT  ,  (l) 


where  T  is  the  set  of  all  possible  trajectories, 
^  ES=1  1  is  the  reward  of  trajectory  t,  and 

P&(r)  =  f*o(*l)  F1!!_i  I  I  *m) 

is  the  probability  of  t  with  initial  stale  distribution 

Po^-^OJl- 

To  maximize  J{‘)*  most  policy  gradient  algorithms  (e.g., 
episodic  REINFORCE  L  Williams,"  1992J,  PoWER  [RuckstieB 
ei  al,  2008 1 .  and  Natural  Actor  Critic  l Peters  &  SchaaL 
200SJ)  employ  standard  supervised  function  approximation 
to  learn  6  by  maximizing  a  lower  bound  on  J{9}*  To  max¬ 
imize  this  lower  bound,  these  methods  generate  trajectories 
using  the  current  policy  rre*  and  then  compare  the  result  with 
a  new  policy  ir$.  Kobcr  &  Peters  1201 1]  describe  how  this 
lower  bound  on  the  expected  return  can  be  attained  using 
Jensen's  inequality  and  the  concavity  of  the  logarithm: 

logoff?)  =log  f  Pg(r)m(T)dT 

^  *  J T  'P0\T) 

>  /  Pff(r)  9t(r)  log  dr  +  constant 
J  7  Pff\T } 

«  -2ki.(w(t)  M{t)  ||  p6(t))  =  - 


when;  Dkl{p(t)  II  l{r))  -  /p(t)  log  ^-1  ilr.  Con  sc- 
J  T1  T ) 

quently,  this  process  is  equivalent  to  minimizing  the  KL  di¬ 
vergence  ©kl  between  tt^'s  rewind- weighted  Trajectory  dis¬ 
tribution  and  the  trajectory  distribution  p#  of 


Policy  gradient  methods  have  gained  attention  in  the  RL 
community  in  pari  due  to  their  successful  applications  to 
robotics  l Peters  ei  al .,  20051.  While  such  methods  have  a  low 
computational  cost  per  update,  high -dimensional  problems 
require  many  updates  (by  acquiring  new  rollouts)  to  achieve 
good  performance.  Transfer  learning  and  multi-task  learning 
can  reduce  these  data  requirements  and  accelerate  learning. 

3  Related  Work  on  RL  Knowledge  Transfer 

Knowledge  transfer  between  tasks  has  been  explored  in  the 
context  of  transfer  learning  and  multi-task  learning.  In 
all  cases,  each  task  f  is  described  by  an  MDP  Z w  = 

Transfer  learning  aims  to  improve  the  learning  time  and/or 
behavior  of  the  agent  on  a  new  target  task  by  transferring 
knowledge  learned  from  one  or  more  previous  source  tasks 
![ Taylor  &  Stone,  2009],  However,  transfer  learning  methods 
only  focus  on  optimizing  performance  on  the  target  task,  and 
therefore  arc  not  ideal  for  agents  that  revisit  earlier  tasks.  In 
contrast,  multi  task  learning  (MTL)  methods  iLi  et  al.  2009; 
Lazaric  &  Ghavamzadch,  201 0J  optimize  performance  over 
all  tasks,  often  by  training  task  models  simultaneously  while 
sharing  knowledge  between  tasks,  but  arc  computationally 
expensive  in  lifelong  learning  scenarios  where  the  agent  must 
learn  tasks  consecutively  over  time  [Ruvolo  &  Eaton,  2013; 
Thrun  &  O’Sullivan,  1996],  as  we  explore  in  this  paper. 

One  exception  is  PG-ELLA  [Bou  Ammar  er  al,  20141, 
a  recent  lifelong  policy  gradient  RL  algorithm  that  can  ef¬ 
ficiently  loam  multiple  tasks  consecutively  while  sharing 
knowledge  between  task  policies  to  accelerate  learning.  In 
fact,  PG-ELLA  can  be  viewed  as  a  special  case  of  the  algo¬ 
rithm  we  propose  in.  this  paper  where  learning  is  limited  to 
only  tasks  from  a  single  domain.  MTL  for  policy  gradients 
has  also  been  explored  by  Deisenroth  et  al.  [2014]  through 
customizing  a  single  parameterized  controller  to  individual 
tasks  that  differ  only  in  the  reward  function.  Another  closely 
related  work  is  on  hierarchical  Bayesian  MTL  [Wilson  et  al., 
2007],  which  can  learn  RL  tasks  consecutively,  but  unlike 
our  approach,  requires  discrete  states  and  actions.  Sue  I  and 
Whiteson’s  120 1 4]  representation  learning  approach  is  also 
related,  but  assumes  all  tasks  share  the  same  feature  and  ac¬ 
tion  sets.  All  of  these  MTL  methods  operate  on  tasks  from  a 
single  domain,  and  do  not  support  cross-domain  transfer. 

To  enable  transfer  between  tasks  with  different  state  and/or 
action  spaces,  transfer  learning  and  MTL  methods  require  an 
inter-task  mapping  to  translate  knowledge  between  tasks.  The 
earliest  work  on  cross-domain  transfer  in  RL,  by  Taylor  et  al., 
required  a  hand-coded  mapping  1 2007 1  01  a  computation¬ 
ally  expensive  exploration  of  all  permitted  mappings  [20081. 
Mote  recently,  unsupervised  techniques  have  been  developed 
to  autonomously  learn  inter-task  mappings  [Bou  Ammar  et 
al.  2012;  2013 J.  While  these  approaches  enable  autonomous 
cross-domain  transfer,  they  only  learn  pairwise  mappings  be¬ 
tween  tasks  and  are  computationally  expensive,  making  them 
inapplicable  for  transfer  among  numerous  tasks  from  differ¬ 
ent  domains.  In  contrast  to  these  methods,  our  approach  pro¬ 
vides  a  computationally  efficient  mechanism  for  transfer  be¬ 
tween  multiple  task  domains,  enabling  cross-domain  lifelong 
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RL.  Cross-domain  MTL  has  also  been  explored  in  a  limited 
fashion  in  supervised  settings  [Han  t?t  ai,  2012]. 

4  Problem  Definition 

Previous  work  on  policy  gradients  has  focused  on  either 
single-task  learning  or  MTL  within  a  single  task  domain  (i.e., 
all  tasks  share  a  common  state  and  action  space,  but  may  dif¬ 
fer  in  other  aspects).  We  focus  on  the  problem  of  learning 
multiple  tasks  consecutively,  where  the  tasks  may  be  drawn 
from  different  task  domains.  Specifically,  the  agent  must 
learn  a  series  of  RL  tasks  ,  , ,  i^7™^  over  its  lifetime, 

where  each  task  is  an  MDP  and  the  tasks  may  have  dif¬ 
ferent  stale  and/or  action  spaces.  We  can  partition  the  se¬ 
ries  of  RL  tasks  into  task  groups  . . . ,  such  that 

all  tasks  writhin  a  particular  group  (Le.t  a  set  of  tasks) 
share  a  common  state  and  action  space,  and  are  generated  by 
varying  latent  parameters  [Konidaris  &  Doshi- Velez,  2014J. 
In  our  experiments,  each  task  group  corresponds  to  a  task 
domain — a  class  of  dynamical  systems,  such  as  helicopters, 
cart- poles,  etc,,  each  of  which  contains  multiple  tasks  corre¬ 
sponding  to  multiple  physical  helicopters  or  cart-poles.  How¬ 
ever,  this  framework  can  easily  generalize  so  that  a  particular 
group  could  represent  a  subset  of  tasks  from  a  domain, 
similar  to  task  clustering  frameworks  iKang  et  ti/.T  201 1 J. 

The  agent  must  learn  the  tasks  consecutively,  acquiring 
multiple  trajectories  within  each  task  be  foie  moving  to  the 
next.  The  tasks  may  be  interleaved,  offering  the  agent  the  op¬ 
portunity  to  revisit  earlier  tasks  (or  task  domains)  for  further 
experience,  but  the  agent  has  no  control  over  the  task  order. 
We  assume  that  a  priori  the  agent  does  not  know  the  total 
number  of  tasks  TmB*,  their  distribution,  the  task  order,  or  the 
total  number  of  task  groups  £jmaX.  The  agent  also  has  no  prior 
knowledge  about  the  inter -task  mappings  between  tasks,  and 
so  it  must  also  learn  how  to  transfer  knowledge  between  task 
domains  in  ol  der  to  optimize  overall  performance. 

The  agent’s  goal  is  to  learn  a  set  of  optimal  policies 
II*  =  { ?rj( }  with  corresponding  parameters 
©*  =  . . . .  #(Tnix)* } -  Since  tasks  belong  to  different 

domains,  the  dimension  of  the  parameter  vectors  will  vary, 
with  where  d{i)  is  the  dimension  of  the  stale 

space  .  At  any  time,  the  agent  may  be  evaluated  on  any 
previous  task,  and  so  must  strive  to  optimize  its  learned  poli¬ 
cies  for  all  tasks  . . . ,  Z^T\  where  T  =  Y^=\ 
denotes  the  number  of  tasks  seen  so  far  (1  <  T  <  Tmss)  and 
G  is  the  number  of  groups  seen  so  far  (1  <  G  <  Gmu*). 

5  Cross-Domain  Lifelong  RL 

This  section  develops  our  cross-domain  lifelong  RL  ap¬ 
proach.  In  order  to  share  knowledge  between  the  tasks,  wc  as¬ 
sume  that  each  task’s  policy  parameters  g  W1'*'  for  task 
t  £  can  modeled  as  a  sparse  linear  combination  of 
latent  components  from  knowledge  base  B ^  €  WI>t)xk  that 
is  shared  among  all  tasks  in  the  group.  Therefore,  w^e  have 
that  0^  =  with  sparse  task- sped  fie  coefficients 

5^)  e  for  task  L  The  collection  of  all  task-specific  coef¬ 
ficients  for  tasks  in  h  given  by  S{t^}  € 


1  Task  Groups 

Task-Specific  Coefficients 

-  #  ■ 


Group-Specific 

Bases 


Shared  Knowledge  Base  L 


Figure  I :  The  knowledge  framework,  showing  the  shared 
repository  of  transferable  knowledge  X,  group  projections 
^  that  specialize  L  to  each  task  group,  task- specific  coeffi¬ 
cients  over  the  specialized  basis,  and  task  groups. 


Effectively,  forms  a  fc -component  basis  over  the  pol¬ 
icy  parameters  for  all  tasks  in  0^K  enabling  the  transfer 
of  knowledge  between  tasks  from  this  group.  The  task- 
specific  coefficients  are  encouraged  to  be  sparse  to  en¬ 
sure  that  each  learned  basis  component  captures  a  maxi¬ 
mal  reusable  chunk  of  knowledge.  This  knowledge  frame¬ 
work  has  been  use  successfully  by  a  number  of  other  MTL 
methods  l Kumar  &.  Daume  II L  2012;  Maurer  a  ai ,  2013; 
Ruvolo  &  Eaton,  20131  for  transfer  between  tasks  within  a 
single  task  domain. 

To  support  cross-domain  transfer,  wre  introduce  a  reposi¬ 
tory  of  knowledge  X  £  IRrfxfr  that  is  shared  among  all  tasks 
(including  between  task  domains).  This  matrix  L  represents 
a  set  of  latent  factors  that  underly  the  set  of  group-specific 
basis  matrices  . . . ,  We  introduce  a  set  of 

group  projection  matrices  that  map 

the  shared  latent  factors  L  into  the  basis  for  each  group  of 
tasks.  Therefore,  we  have  that  where  the 

group  projection  matrix  With  this  con¬ 

struction,  we  see  that  each  mapping  sE,l^u  )  creates  an  inter¬ 
mediate  knowledge  space  (i.e,,  B^)  that  tailors  the  shared 
repository  L  into  a  basis  that  is  suitable  for  learning  tasks 
from  group  These  group-specific  bases  are  coupled  to¬ 
gether  via  the  1P  mappings  and  the  shared  knowledge  base  X, 
facilitating  transfer  across  task  domains  with  different  feature 
spaces.  The  group  mappings  ^’s  also  serve  to  help  avoid 
overfilling  and  ensure  compactness  of  the  basis,  while  maxi¬ 
mizing  transfer  both  between  tasks  within  a  group  and  across 
task  groups.  This  construction  is  depicted  in  Figure  I , 

5.1  The  Cross-Domain  MTL  Objective 

Under  this  shared  knowledge  framework,  given  a  task  t.  G 
Q^3\  its  policy  parameters  & ^  where 

*(G‘31)  e  I  e  RJKfc  SW  e  R*  and  k  is  the  num- 

her  of  shared  latent  knowledge  components.  Therefore,  to 
train  optimal  policies  for  all  tasks,  wc  must  learn  the  shared 
knowledge  base  X,  the  group  projections  (the  4/Y),  and  the 
task-specific  coefficients  (the  s^’s).  Wc  first  examine  this 
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problem  from  a  batch  MTL  standpoint,  and  then  develop  an 
efficient  online  algorithm  for  optimizing  this  MTL  objective 
arid  enabling  lifelong  learning  in  the  next  section. 

Given  task  groups  l\  . . .  we  can  represent  our  ob¬ 
jective  of  learning  T  =  stationary  policies  while 

maximizing  the  amount  of  transfer  between  task  policies  as 
the  problem  of  minimizing: 

(2) 

£r=l  \ 

+ 

where  0^  —  and  the  L\  norm  of  s ^  is  used  to 

approximate  the  true  vector  sparsity.  We  employ  regulariza¬ 
tion  via  the  Frobenius  norm  1 1  -  |[p  to  avoid  over  fit  ting  on  both 
the  shared  knowledge  base  L  and  each  of  the  group  projec¬ 
tions  1  Note  that  this  objective  is  closely 

related  to  PG-ELLA  [Bou  Am  mar  ex  ai,  20141,  with  the  crit¬ 
ical  difference  that  it  incorporates  cross-domain  transfer. 

5.2  Online  Solution  to  Cross- Domnin  MTL 

Although  Equation  2  allows  for  batch  cross-domain  MTL, 
the  dependence  on  till  available  trajectories  from  all 
tasks  (viaj  ( 0 P0(*>(t)  ®^0r)(^r)  make  the 
batch  approach  unsuitable  for  le arning  tasks  consecutively, 
since  the  learner  requires  all  trajectories  for  acquiring  a  suc¬ 
cessful  behavior.  Here,  we  derive  an  approximate  optimiza¬ 
tion  algorithm  that  is  more  suitable  for  lifelong  learning. 

Standardizing  (he  Objective 

To  derive  the  approximate  optimization  problem,  we  note  that 
policy  gradient  methods  maximize  the  lower  bound  of  J  (0). 
In  order  to  use  Equation  2  for  lifelong  cross-domain  trans¬ 
fer  with  policy  gradients,  we  must  first  incorporate  this  lower 
bound  into  our  objective  function.  Rewriting  the  error  term 
in  Equation  2  in  terms  of  the  lower  bound  yields 


with  LsM.  However,  we  can  note  that 


£{fll)||2  j  . 


Jc,»  OC  -jpeuir)  S)1(j)(t)  log 


JVnM 


Therefore,  maximizing  the  lower  bound  of  Je..& 
equivalent  to  the  following  minimization  problem: 


M 


IS 


min  fpm* j(r)f>l^r)Iog 
J 

i-eTO) 


PMty(T) 


dr  ,  (4) 


which  can  be  plugged  into  Equation  3  in  place  of  Jc  $  y*/ 
to  obtain  the  MTL  policy  gradients  objective. 

Approximate  I, earning  Objective 

To  eliminate  the  dependence  of  the  objective  function  on  all 
available  trajectories,  we  can  approximate  ej  (  )  by  perform¬ 
ing  a  second-order  Taylor  expansion  of  Zc.o  around 

the  optimal  solution  to  Equation  4,  following  the  tech¬ 
nique  used  in  PG-ELLA  LBou  Am  mar  et  aL,  2014J.  Note 
that  attaining  corresponds  to  solving  the  policy  gradient 
problem  of  task  i  £  which  might  be  computationally 
expensive.  Therefore,  rather  than  using  the  above,  we  use 
an  approximation  acquired  by  performing  a  gradient  step  in 

task  t:  at W  =  9  4-  tjX”1  V§w  where  X  is  the 

Fisher  information  matrix.  The  first  derivative,  needed  for  the 
second-order  Taylor  expansion,  is  given  by: 

Veo)  =  ~f P«o>(r)*R(<J(r)  V#0to« P&oMdr 

M{t) 

iogPe(„(r)  -  i«j.'"(4,1)+i;p,,)(&  i  *s?.«s?) 

m=l 

+  Z,  lo8  'fflfo  («m  I  4»)  ■ 

m-1 

Therefore,  we  have  that 

V*,,  Jc,e  (ew) 


r 

-jpmdr)^m 

T.  v*‘> 

°g*iio  l*S?) 

Term 

rn=l 

, 

=  -E  V#„  log^i.,(aW !■»>)  j  ■ 

The  second  derivative  of  Zc.q  is  then: 

K(V)  Y,  log 

Substituting  the  second -order  Taylor  expansion  yields  the  fol¬ 
lowing  approximation  of  Equation  2: 

e.Jl ;!•(*») . TV01)}  fill 


eT(Z.* 

G 


y=t  1  1 

+  IM  1 1"0 


+wimif 


where  ||v||^  =  Av,  the  constant  term  was  suppressed  as 
it  has  no  effect  on  the  minimization,  and  the  linear  term  was 
ignored  by  construction  since  ct^  is  a  minimizer.  Critically, 
wc  have  eliminated  the  dependence  on  all  available  trajecto¬ 
ries,  making  the  problem  suitable  for  an  online  MTL  setting. 
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Learning  the  Polity 

We  fit  the  policy  parameter*  in  two  steps.  Upon  observing  a 
task  t  e  Q<2\  we  use  gradient  descent  to  update  the  shared 
repository  L  and  the  group  projections  'P  ^  *  ),  The  update 
rules,  acquired  by  taking  the  derivative  of  Equation  5  with 
respect  to  L  and  J  \  can  be  written  as: 


AL  =  tjl 


V-5—  y 

“  \gt&\ 


l  £f  '  rents') 


IgSff  E  -r(t,«w  MT 


tes<s> 


(6) 


(7) 


where  tji,  and  tj  ^(sr>j  are  the  learning  rales  for  the  shared 
repository  L  and  group  projections  K  respectively. 


Having  learned  the  shared  representation  L  and  group  pro¬ 
jections,  task-specific  coefficients  are  then  determined 
by  solving  an  instance  of  Lasso  iTibshirani,  1996]  to  encode 
a<(*  in  the  basis  given  by  J  LT  leading  to  a  task-specific 

policy  7T^  =  p9(t)  (a  |  a:)  where  0 ^  )Ls^.  For 

computationally  demanding  domains,  we  can  update  less 
frequently,  instead  allowing  the  policy  to  change  by  altering 
L  and  the  Ws.  The  complete  implementation  of  our  approach 
is  available  on  the  authors7  websites. 


%<»>,  -  hig^_x  ffW)  is  Lips- 

chitz.  with  O  ( jgt77| ) -  Further,  given  enough  gradient  steps, 

for  *p{JS(rJ'|_1M£t!?))  to  minimize  s 

it  is  dear  that  change  in  the  loss  can  be  upper  bounded  by 
2A||l'(lcl’,l)'fc<“1)  -  .®(lC'all-')-(C<!")||^  Combining  the 
Lipschitz  bound  with  previous  facts  concludes  the  proof.  □ 
Proposition  2.  With  hf)  as  the  actual  loss  ht  Q^9\  we  show: 
L  converges  a. s, 

2  h\gw\(*&Qii} converges  a  s  to  0 

3 ,  -  h  (# -'w?  converges  a,s.  toO 

4.  h  converges  a.s. 

Proof.  First,  we  show  that  the  sum  of  positive 

variations  of  the  stochastic  process  = 

arc  bounded  by  invoking 

a  corollary  of  the  Donskcr  theorem  I  Van  dcr  Vaart,  20001. 
This  result  in  combination  with  a  theorem  from  Fisk  [19651, 
allows  us  to  show  that  is  a  quasi- martingale  that 

converges  almost  surely  (a,s,).  This  fact  along  with  a  simple 
theorem  of  positive  sequences  allows  us  prove  part  2  of  the 
proposition.  The  final  two  parts  (3  &  4)  can  be  shown  due  to 
the  equivalence  of  h  and  h|g<ffl>|  as  |  — *  oo.  □ 

Proposition  3.  The  distance  between  )  and  the 

set  of  h  ’s  stationary  points  converges  a>s,  to  0  as  \  |  — y  00, 


6  Theoretical  Guarantees 

This  section  shows  that  our  approach  becomes  stable  as  the 
number  of  tasks  and  groups  grow  large.  Detailed  proofs  and 
definitions  arc  provided  in  an  online  appendix  available  on  the 
authors1  websites.  First,  we  consider  the  one-group  setting  by 
defining  the  following  expected  loss:1 

=  E(„<.),r<.>)  rw|c«)]  /S) 


where  the  expectation  is  over  each  task  t  in  group  ac¬ 
cording  to  the  task's  parameters  {cr^^r'^),  and  £{■)  is  the 
per-task  loss  of  encoding  at^  in  the  basis  given  by  ’Jj ^  i  L. 

Proposition  1. 


Proof.  Here,  we  sketch  the  proof  of  Prop.  1 .  With 
^Zr,  ,  s,  oc^\  w!c  can  show  that 

1  We  super-  or  subscript  variables  with  (|t?^|)  to  denote  the  ver¬ 
sion  of  the  variable  learned  from  tasks  in  Q^K 


Proof.  Both  the  surrogate  hig^^  and  the  expected  cost  h 
have  gradients  that  are  Lipschitz  with  constant  independent 
Of  \Q^}\.  This  fact,  in  combination  with  the  fact  that 
and  g  converges  a,s,  as  |  — >  00,  completes  the  proof.  □ 
Next,  we  consider  the  loss  of  multiple  groups: 

S<S“>  (L)  =  Ec(,>  [ft(WM0  ( * «“)  |flfa>]  .  (9) 

Proposition  4, 

i(«“)  -  £(«“)-■  =  o  (y  -L,1] 

Proof.  This  can  be  easily  shown  as  the  upper  bound  of 
£,  (*  (Ve"")|e(a))  ■)  is  the  sum  of  the  bounds  over 
fc(<p{e<”)|eC»>).  □ 

Proposition  5*  With  g(-)  as  the  actual  loss,  we  show: 

I.  1  )  (X)  converges  a.s. 

2  g(GLi] )  _  g(ti  )  convergence  a,s.  to  0 

3,  g(Gta)}  (L)  ~  g^ta'  (L)  convergence  a,s,  to  0 

4.  g(gt*  ^  (L)  converges  u.s. 

Proof  This  can  be  attained  similarly  to  that  of  Prop.  2.  □ 
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Figure  2:  Learning  performance  after  interleaved  training  over  multiple  task  domains.  Figures  (a)  and  (b)  depict  task  domains 
where  cross-domain  transfer  has  a  significant  impact,  showing  that  our  approach  outperforms  standard  PG  and  PG-ELLA. 
Figure  (c)  demonstrates  that  even  when  a  domain  benefits  less  from  cross-domain  transfer,  our  approach  still  achieves  equivalent 
performance  to  PG-ELLA.  Figure  (d)  depicts  the  average  improvement  in  initial  task  performance  over  PG  from  transfer. 


7  Empirical  Results 

Wc  evaluated  the  ability  of  our  approach  to  learn  optimal  con¬ 
trol  policies  for  multiple  consecutive  tasks  from  six  different 
dynamical  systems  (each  corresponding  to  one  task  domain); 

Simple  Mass  (SM):  The  spring-mass-damper  system  is 
characterized  by  three  parameters:  spring  and  damping  con¬ 
stants  and  the  mass.  The  system’s  state  is  given  by  the  posi¬ 
tion  and  velocity  of  the  mass,  which  varies  according  to  linear 
force.  The  goal  is  to  control  the  mass  to  be  in  a  specific  state. 

Double  Mass  (DM):  The  double  spring-mass-damper  has 
six  parameters:  two  spring  constants,  damping  constants,  and 
masses.  The  stale  is  given  by  the  position  and  velocity  of  both 
masses.  The  goal  is  to  control  the  first  mass  to  a  specific  state, 
while  only  applying  a  linear  force  to  the  second. 

Cart-Pole  (CP):  The  dynamics  of  the  inverted  pendulum 
system  arc  characterized  by  the  cart’s  and  pole’s  masses,  the 
pole's  length,  and  a  damping  parameter.  The  state  is  charac¬ 
terized  by  the  cart's  position  and  velocity,  and  the  pole's  angle 
and  angular  velocity.  The  goal  is  to  balance  the  pole  upright. 

Double  Cart-Pole  (DCF):  The  DCP  adds  a  second  in¬ 
verted  pendulum  to  the  CP  system,  with  six  parameters  and 
six  state  features.  The  goal  is  to  balance  both  poles  upright. 

Bicycle  (Bike):  The  Bike  model  assumes  a  fixed  rider,  and 
is  characterized  by  eight  parameters.  The  goal  is  to  keep  the 
bike  balanced  as  it  rolls  along  the  horizontal  plane. 

Helicopter  (HC):  This  linearized  model  of  a  Cl  1-47 
tandem- rotor  helicopter  assumes  horizontal  motion  at  40 
knots.  The  main  goal  is  to  stabilize  the  helicopter  by  con¬ 
trolling  the  collective  and  differential  rotor  thrust. 

For  each  of  these  systems,  we  created  three  different  tasks 
by  varying  the  system  parameters  to  create  systems  with 
different  dynamics,  yielding  18  tasks  total.  These  tasks 
used  a  reward  function  typical  lor  optimal  control,  given  by 
-  y/{xm  -  ar)T{a:m  -  £)-  \fa^am  where  x  is  the  goal  state. 
Each  round  of  the  lifelong  learning  experiment,  one  task  t 
was  chosen  randomly  with  replacement,  and  task  Cs  model 
was  trained  from  100  sampled  trajectories  of  length  50.  This 
process  continued  until  all  tasks  were  .seen  at  least  once. 

We  then  compared  the  performance  of  cross -domain  life¬ 
long  learning  with  PG-ELLA  and  standard  policy  gradients 
(PG),  averaging  results  over  ^94  trials  per  domain  (each  of 
which  contained  *-60  interleaved  training  rounds).  As  the 
base  PG  learner  in  all  algorithms,  wc  used  Natural  Actor 


Figure  3:  Average  learning  performance  on  a  novel  task  do¬ 
main  (helicopter)  after  lifelong  learning  on  other  domains. 

Critic  [Peters  &  Schaal,  2QU8J.  All  regularization  parame¬ 
ters  (the  /i’s)  were  set  to  c-5,  and  the  learning  rates  and  latent 
dimensions  were  set  via  cross-validation  over  a  few  tasks. 

Figure  2  shows  the  average  learning  performance  on  indi¬ 
vidual  domains  after  this  process  of  interleaved  lifelong  learn¬ 
ing,  depicting  domains  in  which  cross-domain  transfer  shows 
clear  advantages  over  PG-LLLA  and  PG  (e.g,,  DCP,  HC),  and 
an  example  domain  where  cross- domain  transfer  is  less  ef¬ 
fective  (CP).  Note  that  even  in  a  domain  where  cross-domain 
transfer  provides  little  benefit,  out  approach  achieves  equiv¬ 
alent  performance  to  PG-ELLA,  showing  that  cross-domain 
transfer  does  not  interfere  with  learning  effectiveness.  On  all 
task  domains  except  HC,  our  approach  provides  a  significant 
increase  in  initial  performance  due  to  transfer  (Figure  2(d)), 

Cross-domain  transfer  provides  significant  advantages 
when  the  lifelong  learning  agent  faces  a  novel  task  domain. 
To  evaluate  this,  wc  chose  the  most  complex  of  the  task  do¬ 
mains  (helicopter)  and  trained  the  lifelong  learner  on  tasks 
from  all  other  task  domains  to  yield  an  effective  shared 
knowledge  base  L.  Then,  we  evaluated  the  agent's  ability  to 
learn  a  new  task  from  the  helicopter  domain,  comparing  the 
benefits  of  cross-domain  transfer  from  L  with  PG-ELLA  and 
PG  (both  of  which  [earn  from  scratch  on  the  new  domain). 
Figure  3  depicts  the  result  of  learning  on  a  novel  domain,  av¬ 
eraged  over  ten  trials  for  all  three  HC  tasks,  showing  the  ef¬ 
fectiveness  of  cross-domain  lifelong  learning  in  this  scenario. 

8  Conclusion 

We  have  presented  the  first  lifelong  RL  method  that  supports 
autonomous  and  efficient  cross-domain  transfer.  This  ap¬ 
proach  provides  a  variety  of  theoretical  guarantees,  and  can 
learn  effectively  across  multiple  task  domains,  providing  im¬ 
proved  performance  over  single -domain  methods. 
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Abstract 

Policy  advice  is  a  transfer  learning  method  where  a 
student  agent  is  able  to  learn  faster  via  advice  from 
a  teacher.  However,  both  this  and  other  reinforce¬ 
ment  learning  transfer  methods  have  little  theoreti¬ 
cal  analysis.  This  paper  formally  defines  a  setting 
where  multiple  teacher  agents  can  provide  advice 
to  a  student  and  introduces  an  algorithm  to  lever¬ 
age  both  autonomous  exploration  and  teacher's  ad¬ 
vice,  Our  regret  bounds  justify  the  intuition  that 
good  teachers  help  while  bad  teachers  hurt.  Us¬ 
ing  our  formalization,  we  are  also  able  to  quantify, 
for  the  first  time,  when  negative  transfer  can  occur 
within  such  a  reinforcement  learning  setting. 


1  Introduction 

Reinforcement  Learning  (RL)  has  become  a  popular  frame¬ 
work  for  autonomous  behavior  generation  from  limited  feed¬ 
back  [Sutton  and  Barto.  1998 J.  Typical  RL  methods  learn  in 
isolation  increasing  their  learning  limes  and  sample  complex¬ 
ities,  Transfer  learning  aims  to  significantly  improve  learn¬ 
ing  by  providing  informative  knowledge  from  an  external 
source.  The  source  of  such  knowledge  varies  from  source 
agents  to  humans  providing  advice  LErez  and  Smart,  2008; 
Taylor  and  Stone,  2009].  In  this  paper,  we  focus  on  a  frame¬ 
work  referred  to  as  action  advice  or  the  advice  mode l  [Taylor 
et  at,  20 1 4 J .  Hem,  the  agent  (Le.*  student),  learning  in  a 
task,  has  access  to  a  teacher  (another  agent  or  human)  which 
can  provide  action  suggestions  to  facilitate  learning.  Given 
“good -enough”  teachers*  such  advice  models  have  shown 
multiple  benefits  over  standard  RL  techniques.  For  example, 
others  (Taylor  et  al .,  2014;  Zimmer  et  at.T  20141  show  re¬ 
duced  learning  times  and  sample  complexities  for  successful 
behavior 

These  methods,  however,  suffer  from  two  main  draw¬ 
backs,  First,  validation  results  are  empirical  in  nature  and  not 
formally-grounded.  Wc  do  not  have  fundamental  understand¬ 
ing  of  these  methods.  Consequently*  it  is  difficult  to  formally 
comprehend  why  these  methods  work.  Second*  most  of  these 
techniques  require  the  aval lability  of  a  “good -enough"  (op¬ 
timal)  teacher  to  benefit  the  student.  Unfortunately*  access 
to  such  teachers  is  difficult  in  a  variety  of  complex  domains* 


reducing  the  applicability  of  policy  advice  in  real-world  set¬ 
tings. 

In  this  paper,  wc  remedy  the  aforementioned  drawbacks  by 
proposing  a  new  framework  for  policy  advice.  Our  method 
formally  generalizes  current  single-teacher  advice  models  to 
the  multi-teacher  setting.  Our  algorithm  also  remedies  the 
need  for  optimal  teachers  by  exploiting  both  the  student's 
and  the  teacher's  knowledge.  Even  if  the  teacher  is  not  op¬ 
timal,  a  student*  using  our  algorithm,  is  still  capable  of  ac¬ 
quiring  optimal  behavior  in  a  task;  a  property  not  supported 
by  some  state-of-the-art  methods,  e.g.*  learning  from  demon¬ 
stration.  Wc  theoretically  and  empirically  analyze  the  perfor¬ 
mance  of  the  proposed  method  and  derive,  for  the  first  time, 
regret  bounds  quantifying  the  sucecssfu Incss  of  action  advice. 
We  also  provide  theoretical  justification  for  current  methods 
(i.e.,  single-teacher  models)  as  special  case  of  our  formula¬ 
tion1.  Our  contributions  can  be  summarized  as; 

•  defining  (formally)  multi-teacher  advice  models. 

•  introducing  novel  algorithms  leveraging  teacher  and  stu¬ 
dent  knowledge, 

•  deriving  the  regret  analysis  showing  reduced  sample 
complexities, 

•  deriving  theoretical  guarantees  for  single  teacher  advice 
models,  and 

•  quantifying  negative  transfer  under  such  advice  model. 

Interestingly*  these  theoretical  results  justify  a  well-known 
intuition  inherent  to  advice  models:  “good  teachers  help 
while  bad  teachers  hurt.”  The  results  show  that  students  can 
still  achieve  optimal  behavior  when  being  advised  by  had 
teachers.  They,  however,  pay  an  extra  cost  in  terms  of  their 
learning  times  or  sample  complexities*  relative  to  an  optimal 
teacher.  This  should  inspire  researchers  to  adopt  high  quality 
teacher  policies  or  avoid  “bad  teachers"  if  possible. 

Given  our  formalization,  we  also  derive  a  relation  to  nega¬ 
tive  transfer.  Wc  quantify*  for  the  first  time,  the  occurrence 
of  negative  transfer  in  action  advice  models,  shedding  the 
light  on  failure  modes  of  these  methods.  Consequently,  these 
results  yield  two  claims  about  transfer  learning.  First,  high 
quality  transfer  knowledge  may  still  cause  negative  transfer 

'The  full  version;  https : //arxiv. org/pdf /1 60 4 . 
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when  l he  target  algorithm  is  able  IQ  outperform  the  source 
knowledge.  Second,  expert  knowledge  is  i mpoita ni  for  the 
researchers  to  determine  whether  or  not  to  transfer  because 
evaluation  of  the  transfer  knowledge  is  usually  expensive  (it  is 
equivalent  to  evaluating  the  teacher  policy  in  the  target  MDP). 


Algorithm  1  REGAL.C:  Constrained  Optimization 
Input:  parameter  II ,  dataset  A  and  current  time  T 

Output: 

l:  £i  —  current  lime  T 

2:  Use  to  update  the  state  transition  probabilities  by 


2  Preliminaries 

2, 1  Online  Reinforcement  Learning  &  Regret 
Model 


In  RL„  an  agent  must  sequentially  select  actions  to  maxi¬ 
mize  its  total  expected  return.  Such  problems  are  formal¬ 
ized  as  a  Markov  decision  Process  (MDP),  defined  as  At  — 
(S,  A.  V,  'R),  where  .S  and  A  denote  the  finite  state  and  ac¬ 
tions  spaces  with  a  total  size  of  |5|  and  \A\  respectively, 
V  :  S  x  A  x  S  — *  [fl,  1  represents  the  probability  transition 
kernel  describing  the  task  dynamics,  and  72  :  <S  X  A  — >  R  is 
the  reward  function  quantifying  the  performance  of  the  agent. 
The  total  expected  return  of  an  agent  following  an  algorithm, 
to  compute  the  optimal  action-selection  rule  from  a  start¬ 
ing  stale  &  6  S  after  T  time  steps  is  defined  as: 


n5(s,T)  =  e 


T 

.t-0 


(i) 


with  st  e  S  and  at  €  A.  The  goal  is  to  determine  an  optimal 
policy,  7T*  :  $  — >  A  that  maximizes  the  total  expected  return. 

Regret  Model:  Similar  to  standard  online  learning,  we 
quantify  the  performance  of  the  algorithm,  by  measuring 
its  regret  with  respect  to  optimality.  We  define  the  regret  of  a 
slate  s  after  T  time  steps  in  terms  of  the  expected  reward  as: 


A g(s,T)  =  \*T  -Kg(s,T),  (2) 


where  A'  is  the  optimal  reward  acquired  by  following  an  opti¬ 
mal  algorithm  Q*  at  each  time  step.  In  the  general  case  when 
no  reachability  assumptions  arc  imposed,  it  is  easy  to  con¬ 
struct  MDPs  in  which  algorithms  suffer  high  regret.  Hollow¬ 
ing  Putcrman  !2005]T  we  remedy  this  problem  by  considering 
weakly-communicating  MDPs2  defined  as  follows. 

Definition  I.  At i  MDP  is  called  weakly  communicating  in 
such  a  case  where  the  state  set  S  can  be  decomposed  into  two 
subsets,  <5 1  and  A,  In  S\  any  state  is  reachable  front  every 
other  state  under  a  deterministic  policy,  it,  while  states  in  S  ? 
are  transient  under  all  policies , 

The  optimal  gain.  A*  in  Equation  2,  is  state  independent. 
That  is,  any  s  €  S,  shares  the  same  optimal  expected  re¬ 
ward  [Puterman,  20051,  which  can  be  solved  Tor  using: 

fr*  4-  ATe  =  max{7£(s,a)  + 


where  h*  is  an  |2>|  dimensional  vector  typically  referred  to 
as  the  bias  vector,  Va,a  denotes  the  probability  to  transition 
from  s  applying  action  a,  and  e.  €  is  a  unit  dimensional 
vector.  When  needed,  we  explicitly  write  the  dependency  of 
A*  and  h*  as  h*(n\  A1)  and  A*(.M).  We  also  define  the  span 
of  h  as:  sp(/i)  =  max*  1=5  /i(s)  —  min*^  fa(s). 

1  Please  note  that  weakly-communicating  MDPs  arc  considered 
ilie  most  general  among  subclasses  of  MDPs,  see  Puterman  120051. 


KM)  = : 


N{s,  a *  s':  t) 


max{i\T(£,  a;  £)j  1} 
3:  With  £  =  ti ,  M i  is  the  set  of  MDPs  s.t. 


(3) 


*-alli  -  / 


12|g|log(2|.4|t/g) 
max{Ar(.s,  a,  s;i),  1} 


4:  Select  Ml  €  Ml  by  following  optimization  equation 
over  V M  €  A A 

maxA+(A/)  s.t.  sp(Jt*(A4))  <  // 


5:  irj_|_j=avetage  reward  optimal  policy  for  M%  (value  itera¬ 
tion) 

6:  return  ?fr+i 


Finally,  we  follow  Bartlett  and  Tewari  [20091  to  define 
reachability  in  weakly  communicating  MDPs  using  the  one¬ 
way  diameter:  diamonc.w^Af)  —  max#£5  min,  TZ 
with  being  the  expected  number  of  steps  needed  for 

reaching  s  =  argmax,^  Ad)  from  s\  €  <5. 

2.2  Algorithms  for  Weakly1 -Communicating  MDPs 

REGAL.C  is  an  on-line  algorithm  for  weakly  communicat¬ 
ing  MDPs  developed  by  Bartlett  and  Tewari  [2009].  The 
bask  idea  is  that  the  REGAL.C  can  estimate  the  true  MDP 
with  high  probability  in  order  to  learn  an  e-optimal  policy 
with  high  probability.  Let  N(s.  a.  sf,  t )  be  the  number  of 
slaie-action-sliile  triples  (a. a.  a')  that  have  been  visited  at 
time  t.  Further,  let  fr  to  denote  the  initial  time  of  the  itera¬ 
tion  i.  For  brevity,  we  use  A1*  (a,  a,  s')  and  jY, (s,  o.)  to  de¬ 
note  N($.  a.  s':  t.i)  and  N($.  a:t,j)  at  iteration  i.  We  also 
use  a)  =  JVi+1(s,a)  -  N^s,  a)  to  denote  the  num¬ 
ber  of  times  a  state-action  pair  [s.  a)  is  visited  during  it¬ 
eration  i.  For  each  iteration  K  REGAL  acquires  a  dataset 
I?,  as  input  and  updates  the  transition  probability  (see  Equa¬ 
tion  3).  It  then  constructs  a  set  of  MDPs  M1  to  select  from  us¬ 
ing  max  A*  (A/)  s.t.  ,s/?(/i*(A/))  <  H ,  where  H  is  the  upper 
bound  on  the  span  sp(h*(Al)).  Given  the  MDP,  REGAL.C 
uses  value  iteration  Tor  acquiring  the  optimal  policy.  These 
steps  are  summarized  in  Algorithm  1 . 

2.3  Single  Teacher  Advice  Model 

The  single  teacher  advice  model  is  a  framework  in  which  a 
student  learning  in  an  environment  benefits  from  a  teacher's 
advice  to  speed-up  learning.  We  define  such  a  framework  as 
the  tuple  of  (4r,  6,  0.  fa).  Here,  it*  denotes  the  teacher  s 
policy*  b  represents  the  budget  eon  straining  the  teacher's  ad¬ 
vice,  0  is  the  student,  and  f,i  is  a  function  controlling  the  ad¬ 
vice  from  the  teacher  to  the  student.  Apart  from  considering 
single  teacher  models,  previous  work  assumed  optimal  teach¬ 
ers  where  students  always  execute  recommended  actions.  It  is 
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easy  to  construct  complex  settings  in  which  access  to  optimal 
teachers  is  difficult-  Consequently,  we  extend  these  works 
to  l he  more  realistic  settings  of  sub- optimal  teachers,  as  we 
detail  later. 

3  Multiple  Teacher  Advice  Model 

In  this  section  we  start  by  extending  the  single  teacher  model 
of  Taylor  et  at.  1 20 1 41  to  the  multiple  non-optimal  teacher  set¬ 
ting.  Our  advice  model  for  m  teachers  is  defined  as  the  tuple 
(FL  Br where  FI  =  {n-f 1 ,  a, , , , ,  n*™ }  is  the  set  of 
m  €  FJ  teacher  policies,  and  B  =  {bi  >  6^, . . . ,  6m}  denotes 
the  set  of  budgets.  It  is  easy  to  sec  that  in  case  FI  =  {ti  }  and 
B  =  {&},  wc  can  easily  recover  the  special  case  single  teacher 
model.  We  also  generalize  the  work  of  Taylor  cr  ai  [201 4  J  by 
making  no  restrictive  assumptions  on  the  optimality  of  any  of 
the  teachers.  We  measure  the  performance  of  the  teacher  with 
respect  to  a  base  policy  k*  in  terms  of  regret: 

Definition  2.  Given  a  teacher’. t  policy,  €  11,  and  a  base 
policy  it*  „  then  the  regret  of following  irT  is  related  w  that 
acquired  hy  following  using: 

Ar(*,T)  =  pA*{s,T), 

where  p  >  0  denotes  the  regret  rath,  A’r{s,T)  —  A *T  - 
H%{s,  T)  and  A*  {s,  T)  =  A *T  -  11*  (s,  T). 

The  above  definition  captures  the  three  interesting  eases 
quantifying  the  performance  of  an  advice-based  algorithm.  If 
the  teacher  is  optimal,  i.e.,  when  A  *  (s,  T1  —  0,  p  is  also  0- 
In  case  0  <  p  <  1,  then  At(s.T)  <  A *(s.T)  indicating 
the  the  teacher’s  policy  is  at  least  as  good  as  the  base  policy 
tt®.  Finally,  when  p  >  1,  A ^(s.T)  >  A *[s,T)  imply¬ 
ing  the  underperformance  of  the  teacher.  Consequently,  with 
the  correct  choice  of  the  teacher  by  p  one  can  still  achieve 
successful  advice  even  in  such  a  generalized  selling. 

4  Efficient  Multi-teacher  Advice 

In  this  section,  we  propose  a  new  algorithm  which  combines 
the  advice  policy  and  the  MDPs  information  collected  so  far. 
This  allows  for  an  accurate  framework  outperforming  state- 
of-the-art  techniques  for  policy  advice.  On  a  high  level,  our 
algorithm  consists  of  three  main  steps.  First,  a  combined  pol¬ 
icy  is  constructed  based  on  multiple  teachers.  Second,  data 
depending  on  both  teacher’s  advice  as  well  as  MDP  informa¬ 
tion  is  collected.  Third,  a  new  policy  is  computed  online. 
Next,  we  outline  each  of  the  three  steps  and  describe  our 
novel  algorithm.  Having  achieved  an  accurate  advice  model, 
we  then  rigorously  analyze  the  theoretical  aspects  of  our 
method  and  show  a  decrease  in  sample  complexities  com¬ 
pared  to  current  techniques. 

4,  l  The  G ran  d  -Teacher 

Our  method  of  policy  advice  constructs  a  grand  teacher  com¬ 
bining  all  teacher  policies  in  a  mcta-policy.  To  construct 
the  grand -teacher,  we  use  an  ensemble  method  and  design 
two  mcta-policy  variations:  online  and  offline-constructions. 
Next,  wc  detail  each  of  the  two  variations. 

Online  Grand -Teacher  i  In  the  online  construction,  when¬ 
ever  the  student  observes  an  unvlsited  state,  s  €  ■£»,  each 


Algorithm  2  Offline  Construction  of  the  Mcta-Tcachcr 
Input:  The  set  of  states  in  the  MDP,  S. 

I :  while  3s  e  S  is  not  visited  do 
2:  Follow  a  policy  in  the  MDP 

3:  if  Current  state  s  is  not  visited  then 

4,  Query  all  teachers  for  advice  and  select  action  a  us¬ 
ing  Majority  Vote, 

5:  return  ir ****««>" 


teacher  provides  its  perl  icy  advice  of  the  form  nf  „  for  all 
i  E  ,77?)  with  m  being  the  total  number  of  teachers. 

The  student  then  selects  and  stores  the  majority  action  from 
all  teachers  for  that  state  s.  As  far  as  budget  is  concerned,  it 
is  easy  to  see  that  we  only  require  to  know  advice  for  each 
state  in  S,  thus  i>L  —  —  frm  —  |*S[.  Though  easy  to 

implement  and  lest,  the  online  construction  suffers  from  the 
potentially  unrealistic  need  for  the  continuous  availability  of 
online  teachers. 

Offline  Grand-Tcachcr:  To  eliminates  the  need  for  an  on¬ 
line  teacher  at  each  visit  of  a  new  state,  the  offline  procedure 
traverses  the  states  in  the  MDP  for  constructing  the  meta¬ 
advice  policy.  The  main  steps  of  this  construction  is  sum¬ 
marized  in  Algorithm  2. 

Note  that  Algorithm  2  is  capable  of  constructing  an  olllinc 
meta-teacher  but  requires  extra  exploration  in  the  MDP.  We 

next  show  that  Q  (J<5j  log  ^  J  steps  are  enough  to  explore 
each  stale  in  the  MDP  with  high  probability: 

Theorem  1  (Sample  Complexity),  If  Algorithm  2  indepen¬ 
dently  and  uniformly  explores  each  state  s  t  S,  then  with 
probability  of  at  least  I  —  i5,  O  |jiS|  log  steps  are  suffi¬ 
cient  to  visit  each  stale  at  least  one  time. 

4,2  Mult  i  -Teac  her  Ad vi ce  Algo  rit  hin 

To  improve  current  methods  and  arrive  at  a  more  realistic 
advice  framework,  wc  now  introduce  our  algorithm  combin¬ 
ing  the  grand -teacher’s  policy  and  information  attained  by  the 
student  from  the  MDP. 

Our  algorithm  is  based  on  the  following  intuition.  At  the 
beginning  of  the  learning  process,  a  student  requires  guidance 
as  it  typically  has  little  to  no  information  of  the  task  to  be 
solved.  As  time  progress  and  the  student  explores,  the  MDP 
can  be  effectively  exploited  for  successful  Learning,  Unfortu¬ 
nately,  such  a  process  is  not  well  modeled  using  current  meth¬ 
ods.  Here,  wc  remedy  this  problem  by  introducing  an  algo¬ 
rithm  which  follows  the  teacher  s  advice  at  the  very  beginning 
and  then  switches  to  a  policy  computed  by  an  algorithm  oper¬ 
ating  within  the  MDP,  That  is,  the  teacher  guides  the  student 
at  the  beginning  of  the  learning  process  and  as  the  student 
gathers  more  experience,  the  teacher’s  influence  diminishes 
over  time  by  switching  into  a  policy  computed  by  REGAL.C. 
The  overall  procedure  is  summarized  in  Algorithm  3.  Note 
that  our  algorithm  is  inspired  by  DAGGER  (Ross  etal.t  20101 
in  the  sense  that  policies  are  updated  by  collecting  data  using 
a  mixture  of  action  selection  rules  (he.,  student  and  teacher 
policies).  Contrary  to  DAGGER,  howrever,  our  method  col- 
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Algorithm  3 

Input:  77  •  =  the  grand -teac her  policy  7i"i=any  policy 
Output:  TTj  the  ^-optimal  policy 

1:  r  =  u 

2:  For  i  =  1  to  m  do 

3:  Let  TTi+j  =  &ii\T  +  (1  —  ft)7F* 

4:  Follow  it*  until  7* -steps 

5:  Get  dataset  D $ 

6:  T  -  T  +  Ti 

7:  =  REG  A  L.C(DiyT)  {See  Algorithm  1} 

8:  return  7rm+i 


tects  all  trajectories  opposed  to  only  collecting  inconsistent 
actions,  allowing  for  more  accurate  and  efficient  updates. 

To  leverage  both  the  teacher’s  and  learned  policies,  we  set 
a  mixed  policy  of  the  form  TVi+%  —  ?rT  +  (1  —  )  77j ,  for 
0  ^  Pi  ^  I  to  guide  the  student’s  dataset  collection  while  al¬ 
lowing  the  teacher  to  fractionally  control  exploration  needed 
to  collect  data  at  the  next  iteration.  0  should  typically  he  set 
so  as  to  decay  exponentially  over  lime.  This  decreases  the 
student’s  reliance  on  the  teacher  and  allows  it  to  exploit  the 
knowledge  gathered  from  the  MDF  to  learn  better  behaving 
policies  than  that  of  the  teacher.  It  is  for  this  reason  that  our 
algorithm,  contrary  to  other  methods,  does  not  impose  any 
optimality  restrictions  on  the  teacher.  Having  collected  the 
dataset.  Algorithm  3  uses  RFGAL.C  (Algorithm  I )  to  update 
fii- 


43  Theoretical  Guarantees 

In  this  section  we  formally  derive  the  regret  exhibited  by  our 
algorithm.  At  a  high  level,  we  provide  two  theoretical  re¬ 
sults.  In  the  first,  we  consider  the  general  teacher  case,  while 
in  the  second  we  derive  a  corollary  of  the  regret  for  optimal 
teachers.  We  show.  Tor  the  first  lime,  better  than  constant 
improvements  compared  to  standard  learners. 

Theorem  2.  Assume  Algorithm  3  is  running  for  total  T 
steps  in  a  weakly  communicating  MPD  M  starting  from 
an  initial  state  s  €  S.  Let  II  be  a  parameter  such 
that  H  >  sp(h*(M)).  Then „  with  a  probability  of  at 
least  1  —  <5,  the  total  regret  is  given  by:  A («,  T)  = 

O  ((1  -  p  +  p0)H\S\  f \A\T  where  0  e  [0, 1] 

such  that  1-/3  =  —  if],  and  p  >  0  is  the  ratio 

between  the  teacher's  regret  A’1  and  the  regret  exhibited  by 
REGALC  AkEOaLC  such  that  A1  <  f?Ah'lGAIC. 

Proof.  Due  to  the  space  limits,  we  provide  a  proof  sketch. 
The  proof  is  based  on  the  regret  bound  of  RFGAL.C.  We  in¬ 
troduce  the  regret  ratio  to  reduce  the  grand-teacher’s  regret 
to  the  REGAL.C’s  regret.  Then,  we  apply  the  Hoeffding’s 
inequality  to  arrive  at  the  statement  of  the  theorem.  □ 

Theorem  2  implies  that  the  teacher  improves  learning  as 
long  as  it  is  “good."  Namely,  if  0  <  p  <  1,  1-/3  + p$<l, 
j3  €  [0?  1]  which  implies  the  student  can  enjoy  a  fraction  of 
REGAL.C's  regret.  However,  Up  >  L  1  —  j3  pft  >  L 
the  student  suffers  more  regret  than  the  original  REGAL.C 


algorithm.  This  justifies  our  intuition  that  good  teachers  assist 
learning  while  poor  ones  hamper  learning.  Moreover,  if  there 
exists  prior  knowledge  that  a  teacher  has  poor  performance, 
it  would  be  better  off  for  the  student  to  neglect  its  advice  as  it 
will  suffer  extra  regret. 

If  the  teacher’s  p  t)T  we  have  the  following  Corollary: 
Corollary  I.  //  the  teacher  is  optimal ,  then  with  at  least  a 
probability  ofl  —  S  the  total  regret  is  given  by:  A(s.T)  = 

e>((i-/3)//|5|v/|^|Tiogm2:). 

Remark  1.  Please  note  that  the  above  theoretical  results  are 
more  than  a  constant  improvement  to  the  regret ,  Notice  that 
f3  depends  on  the  number  of  iterations  which  can  be  bounded 
by  |5|  and  \A\  of  the  input  MDF  M  I  Auer  ct  al„  20091  Fur¬ 
ther,  p  depends  on  the  input  teacher's  policy  which  is  also 
an  input  to  Algorithm  3,  Consequently,  it  can  be  shown  that 
these  regret  improvements  exceed  simple  constant  bounds. 


5  Negative  Transfer 

To  formalize  the  relation  to  negative  transfer,  we  recognize 
that  the  regret  ratio  can  be  written  as: 

_  A^s.T)  _  A-T-TCT(a,7  ) 

P  A  *T-K®(s,T) 

This  suggests  that  we  can  estimate  the  ratio  by  calculating  A* 
and  H*( s.  T),  given  a  policy  it.  So,  we  use 

,  \*T-KWl{s,T) 

/>(*!,  *4,  T)-  A.T_R»a(s  r) 

to  denote  the  regret  ratio  between  policy  tt\  and  ^  uniil  step 
T.  At  this  stage,  we  define: 

*  Negative  transfer  from  policy  nq  to  nr  2  until  T  steps: 
p(tti,7V2,T)  >  1. 


•  Positive  transfer  from  policy  Tt\  to  it 2  until  T  steps: 

<  1 

To  formalize  negative  transfer,  our  goal  at  this  stage  is  to 
relate  p{)  to  a  metric  between  sou  ice  and  target  tasks.  For 
that  sake,  we  define:  df  (tt5)  —  7£*’,(h.  T)  -  7Zja(s,T),  with 
'R**($,T)  and  7t?*(s,T)  being  the  agent’s  estimates  of  the 
rewards  in  the  source  and  the  target  after  T  steps.  Conse¬ 
quently,  an  estimate  p  to  p  can  be  derived  as: 
p(itVfittiT) 


_  X*T  —  ii**(3,T} 

-  A *T-7^'(s,r) 

A*  T  +  (itj-  (*,  T)  -  ig’  (a,  T) )  -  RJ-  (3,  T) 

A*T  —  72”'(s,  T) 

vr +  $(»,) -£;•(«,  r) 

S’T-1$‘(s,T) 

and  72* r  (0.  T)  can  be  bounded  by  the  Empirical 

Bernstein  bound  lAudibcrt  et  ah,  2007],  With  a  probability 
1  —  we  have 


P  T 

.1=0 


<«i. 
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with 


= 


=,  n^ys) 


kigjyS) 


fn9  E"=i(^  _  *s  standard  deviation  of  the  sam¬ 
ple  ,  we  derive 


A *T +  *{*.)- C2  ^s,_  _  ^A *T  +  dt(T,)-Ci 

- Xfi  j,  _  c - -  <  p{7iSint/r)  <  - \*  f~-  C - 

(5) 

with  C\tC2,Ct>  and  C4  are  constants.  Consequently,  for 
negative  transfer: 

™  ^  A *T  +  dfK)-C2  ^  , 

K-S^1;r)> — - >1. 

Then,  assuming  enough  samples,  negative  transfer  occurs  if: 


> 


TZs{st,at) 


(6) 


The  condition  sheds  light  on  the  negative  transfer  in  the 
sense  of  metric  notation  and  provides  a  formal  way  to  deter¬ 
mine  negative  transfer.  First,  if  the  condition  in  Eq.  6  holds 
after  evaluation,  researchers  should  avoid  the  source  policy  ?ra 
to  the  target  tasks  since  it  may  cause  negative  transfer.  Sec¬ 
ond,  if  the  researchers  have  enough  expert  knowledge  about 
I  heir  working  domain  and  transfer  information,  usually  they 
can  avoid  this  evaluation  phase  in  practice.  In  short,  Eq.  6 
provides  a  formal  way  to  understand  negative  transfer  and 
justify  the  intuition  (adopt  high  quality  source  knowledge  and 
avoid  had  teachers  )  in  the  transfer  practice. 


6  Experimental  Results 

Given  the  above  theoretical  successes,  this  section  provides 
empirical  validation  on  three  domains: 

Combination  Lock:  We  use  the  domain  described  in  Fig¬ 
ure  2  which  is  a  variation  from  l Whitehead,  1991 !.  The  ex¬ 
perimental  setting  follows  the  caption  description. 

Grid  World  is  an  RL  benchmark  in  which  an  agent  has 
to  navigate  an  m  x  m  grid  world  with  the  goal  of  reaching 
a  goal  state.  We  employ  an  11  x  11  grid  world  with  a  four 
room  layout  as  introduced  in  Sutton  and  Barto  11998],  The 
agent  begins  in  the  lower  left  comer  of  the  map  and  navigates 
to  the  goal  state  being  the  upper  right  corner.  To  navigate,  the 
agent  has  access  (in  each  ceU)  to  four  actions  transitioning 
it  to  the:  north,  south,  west  and  cast.  Applying  an  action, 
it  then  transitions  in  that  direction  with  a  probability  of  0,8 
and  in  the  other  three  with  a  probability  of  0.2,  In  case  the 
direction  is  blocked,  the  agent  stays  in  the  same  state.  Finally, 
the  agent  receives  a  reward  of  0  once  reaching  the  goal  state 
and  a  reward  of  -1  in  all  others. 

Block  Dude  is  a  game  where  an  agent  again  navigates  a 
maze  to  reach  a  goal  state.  Reaching  the  goal  direct Jy  is  im¬ 
possible  due  to  the  presence  of  blocks  restricting  its  move¬ 
ment.  The  agent,  however,  can  move  to  the  left,  right,  and 
upwards.  To  reach  the  goal  state,  it  needs  to  pick-up  blocks 
and  relocate  than  in  correct  positions.  We  use  the  default 
level  l  BURLAP  fMacGlashan,  2014]  in  which  there  are  two 
blocks  and  3  x  25  maze.  The  agent  receives  a  reward  of  +1 
in  the  goal  state  and  a  reward  of  —  1  in  all  other  states. 


6.1  Experimental  Setup  &  Results 

To  construct  the  grand  teacher,  we  set  the  total  number  of 
teachers  h  =  10,  For  each  teacher,  the  budget.  64,  is  set 
to  the  total  number  of  states,  in  Algorithm  3,  the  maximum 
number  of  iterations  and  the  size  of  each  dataset,  IA,  were 
set  to  10  and  200,  respectively.  Values  of  pl  =  0.5*  for 
i  —  1, , , , , U)  were  used  to  determine  &  —  p\  For  Al¬ 
gorithm  1,  the  confidence  S  was  set  to  0,8  and  H  to  1000. 
The  optimal  gain  A*  and  the  optimal  bias  vector  h  can  be 
approximated  using  the  value  function  V  iPutcrman,  2005 1. 

Let  Vi  be  the  value  function  at  iteration  I,  l  =  0.1, _ 

The  optimal  gain  X*  ss  spfV*4'1  -  where  sp(V)  = 
max*££  V((s)  -  minxes  V^s)  and  the  optimal  bias  vector 
h*  as  V  -  /A',  when  l  is  large  enough.  To  smooth  the 
natural  variance  in  the  student's  performance,  each  learning 
curve  is  averaged  over  10  independent  trials  of  student  learn¬ 
ing.  To  better  evaluate  our  method,  we  adopt  six  experimental 
settings  by  considering  different  teachers  and  learning  algo¬ 
rithms.  For  teachers,  we  consider  three  forms.  The  lust,  re¬ 
ferred  to  as  "optimal  teacher"  provides  optimal  actions  and  is 
used  by  the  grand  teachers,  The  second,  referred  to  as  "worst 
teacher"  advices  the  student  to  take  actions  with  the  lowest 
Q-values,  while  the  last  randomly  selects  action  suggestions 
from  the  set  of  allow  ed  moves.  We  also  compare  our  method 
to  RBGAL,C  (no  advice),  optimal  policy  (without  learning), 
and  Azars  method  [AiaietaL,  20 1 3  j .  Please  note  that  Azars 
method  can  not  converge  to  the  optimal  policy  and  suffers 
loss  as  its  performance  is  restricted  by  The  teacher. 
Performance,  measured  by  the  average  reward,  is  reported 
in  Figure  1.  First,  it  is  clear  that  given  optimal  teachers,  our 
method  exactly  traces  the  optimal  policy  achieving  a  regret  of 
fl.  It  is  also  important  to  note  that  in  all  three  domains,  even 
if  the  teacher  was  not  optimal,  and  contrary  to  current  tech¬ 
niques,  our  method  is  capable  of  acquiring  optimal  behavior. 
This  is  achievable  as  our  method  allows  for  learning  within 
the  multiple  teacher  framework. 

7  Related  Work  on  Transfer  Learning 

Few  theoretical  results  on  transfer  and  policy  advice  have 
been  achieved.  Closest  to  this  work  is  that  in  Taylor  et 
ai.  L2014J.  where  the  author's  only  provide  empirical  valida¬ 
tions  to  their  approach  without  drawing  on  any  theoretical 
analysis.  Given  the  theoretical  derivations  in  this  paper,  we 
in  fact  note  that  the  method  [Taylor  et  <7/. ,  20141  is  a  special 
case  of  ours  considering  only  one-teachcr  advice  models. 
Another  method  considering  advice  under  multiple  teach¬ 
ers  is  that  in  Azar  et  ai  [2t)l3].  Azar  et  ai  propose  a  method 
capable  of  selecting  the  best  policy  from  a  set  of  teacher 
policies  and  derive  sub-linear  regret  of  the  form  Q{\fT) 
with  T  being  the  total  number  of  rounds.  One  drawback  of 
their  method,  however,  is  the  assumption  of  a  "good-enough” 
teacher  which  can  guide  the  student  to  optimality.  Such  a 
method  may  suffer  huge  regret  if  the  overall  quality  of  teacher 
policies  is  poor.  It  also  can  not  obtain  better  policies  than 
those  of  the  teacher.  Our  algorithm  remedies  these  problems 
by  allowing  agents  to  further  improve,  which  gives  them  the 
opportunity  to  surpass  the  teacher’s  performance. 
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{a)  Grid  World  £b)  Combination  Lock  (c)  Block  Dude 

Figure  I :  Our  method  with  optimal  teacher  has  similar  performance  as  the  optimal  policy.  And  the  REGAL  algorithm  (no 
advice)  outperform  random  teacher  and  worst  teacher  group  which  justifies  that  the  poorer  teachers  do  harm  the  learning. 
Azar's  method  depends  on  the  quality  of  the  teachers  —  when  the  teachers  are  very  poor,  the  algorithm  shows  no  learning. 


Figure  2:  There  arc  n  +  1  states  in  the  MDR  The  last  state 
has  only  one  action  and  the  rest  have  two,  The  agent  receives 
reward  —1  for  all  actions,  except  when  taking  action  a  in  state 
nt  7£{?t,a)  —  1.  The  agent  stays  in  state  n  with  probability 
0.9  and  goes  to  state  0  with  probability  0 .  L  The  optimal  pol¬ 
icy  is  to  take  action  a  in  each  state.  Since  them  arc  n  +  1 
states,  the  budget  ©  is  at  least  n  +  1  to  achieve  zero  regret. 

Human  advice  is  also  a  good  source  of  policy  advice.  Usu¬ 
ally,  this  method  adopts  the  human  advice  as  the  teacher's 
policy  to  improve  the  learning  performance.  However,  these 
works  focus  on  empirical  validations  LCakmak  and  Lopes. 
2012;  Griffith  et  aL  20131. 

Probabilistic  policy  reuse  is  similar  to  our  method  in  which 
the  algorithm  follows  its  own  knowledge  with  probability 
1  —  t  and  teacher’s  policy  with  probability  t  [Fernandez  and 
Veloso,  2006],  However,  c  is  not  decaying  over  time,  making 
the  algorithm  divergent  if  teacher  policies  are  not  optimal, 
Cederborg  et  aL  introduce  a  policy  shaping  algorithm  using 
human  teachers,  but  focus  on  providing  rewards  rather  than 
action  advice  [Cederborg  et  al ,  2G15J.  Both  of  these  works 
rely  solely  on  empirical  results. 

Work  on  transfer  for  RL  is  also  related  to  this  paper,  where 
we  can  consider  policy  advice  as  an  instance  of  transferring 
from  teachers  to  students  [Lazaric,  2012].  Here,  Ferrante 
et  al,  for  instance,  propose  a  method  to  transfer  high  qual¬ 
ity  samples  from  source  to  target  tasks  using  bi-simulation 
measures  [Ferrante  et  al,  20081.  Their  method  only  transfers 
samples  once,  while  our  approach  gradually  provides  advice 
to  the  student,  Due  to  space  constraints,  we  refer  the  reader 
to  Taylor  and  Stone  [2009 1  for  a  comprehensive  survey. 

Lifelong  reinforcement  learning  has  drawn  significant  at¬ 
tention  to  the  transfer  community  recently,  Brunskill  and  Li 
studied  online  discovery  problems  in  a  lifelong  learning  set¬ 
ting  [Brunskill  and  Li,  2015],  Bou-Ammar  et  al  also  studied 
such  a  problem  and  introduced  constraints  on  the  policy  to 


compute  “safe”  policies  l  Bou-Ammar  et  a/.t  2015],  Contrary 
to  these  works,  in  this  paper,  we  focus  on  the  single  agent 
setting  operating  within  one  task. 

Finally,  Learning  from  Demonstration  [Argali  etal,  2009] 
(LfD)  is  also  related  to  our  work,  but  LID  usually  assumes 
that  the  expert  is  optimal  and  the  student  only  tries  to  mimic 
the  expert. 

8  Conclusion  and  Future  Work 

In  this  paper,  we  formally  defined  the  multi -teacher  ad¬ 
vice  model  and  introduced  a  new  algorithm  which  leverages 
teacher  and  student’s  own  knowledge  in  the  weakly  commu¬ 
nicating  MDPs.  We  theoretically  analyzed  our  algorithm  and 
showed,  for  the  first  time,  that  the  agent  can  achieve  optimal¬ 
ity  even  when  starling  from  non -optimal  teachers.  Our  results 
provide  a  theoretical  justification  for  the  intuition  that  “bad” 
teachers  can  hurt  the  learning  process  of  the  student.  Also,  we 
formally  established  the  condition  of  negative  transfer,  shed¬ 
ding  light  on  future  transfer  learning  research,  where  for  ex¬ 
ample,  researchers  can  choose  “good  teachers”  based  on  the 
Eq  6  and  avoid  negative  transfer  with  prior  expert  knowledge. 

In  future,  we  plan  on  adopting  other  online  reinforcement 
learning  algorithms  (e.g.,  REGAL.D  [Bartlett  and  Tewari, 
20091)  R-max  lErafman  and  Tennenholtz,  2003],  or  E 3 
[Kearns  and  Singh,  2002])  to  replace  REGAL.C.  We  will 
provide  boiler  methods  to  construct  the  “grand-teacher"  with¬ 
out  exploring  the  whole  MDP.  Also,  extensions  to  large-scale 
MDPs  may  be  an  interesting  direction  for  future  research  as 
well. 
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ABSTRACT 

Reinforcement  learning  is  a  powerful  machine  learning 
paradigm  that  allows  agents  to  autonomously  leam  to  max¬ 
imize  a  scalar  reward.  However,  it  often  suffers  from  poor 
initial  performance  and  long  learning  limes.  This  paper  dis¬ 
cusses  how  collecting  on-line  human  feedback,  both  in  real 
lime  and  post  hoc,  can  potentially  improve  the  performance 
of  such  learning  systems.  Wc  use  the  game  Pac-Man  to  simu¬ 
late  a  navigation  setting  and  show  that  workers  are  able  to  ac¬ 
curately  identify  both  when  a  sub-optimal  action  is  executed, 
and  what  action  should  have  been  performed  instead.  Demon¬ 
strating  that  the  crowd  is  capable  of  generating  this  input,  and 
discussing  the  types  oT  errors  that  occur,  serves  as  a  critical 
first  step  in  designing  systems  that  use  this  real-time  feedback 
to  improve  systems’  learning  performance  on- the- fly. 

INTRODUCTION 

Reinforcement  learning  [71  is  a  very  flexible,  robust  approach 
to  solving  problems.  However,  early  in  the  training  process 
much  of  the  problem  space  is  unexplored,  often  resulting  in 
poor  performance  because  reasonable  policies  are  only  dis¬ 
covered  after  a  considerable  amount  of  trial-and -error,  In 
this  paper,  we  propose  the  idea  of  using  on-demand  human 
intelligence,  available  via  crowdsourcing  platforms  such  as 
Amazon  Mechanical  Turk,  to  provide  immediate  feedback  to 
reinforcement  learning  systems  based  on  the  intuition  and  ex¬ 
perience  of  the  human  observer. 

To  test  whether  crowd  workers  are  able  to  accurately  provide 
such  advice,  wc  perform  a  set  of  experiments  that  measure  the 
crowd’s  ability  to  generate  just-in-time  warnings  to  an  agent 
playing  Pac-Man,  First,  wc  establish  that  the  crowd  can  col¬ 
lectively  identify  the  correct  point  at  which  an  error  occurs 
with  over  91%  accuracy.  Second,  we  demonstrate  That  not 
only  can  this  mistake  identification  be  done  in  real  lime  with 
a  mean  latency  of  just  0.39  seconds,  but  also  that  workers  are 
able  to  identify  what  the  optimal  move  would  have  been  been. 
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Figure  I;  This  screenshot  shows  the  web  interface  of  the  user 
study  with  game  layout,  and  components  of  the  Pac-Man  game: 
1)  Pac-Man,  2)  4  Ghosts,  3}  Pills,  and  4)  Power  Pills, 

Third,  we  compare  the  crowd’s  performance  in  this  real-time 
setting  with  an  offline  “review'’  setting  where  game  playback 
can  be  controlled  and  replayed,  In  this  setting,  mistakes  can 
be  better  estimated,  with  a  mean  distance  from  the  correct 
position  of  just  0.15  seconds, 

This  work  is  the  first  research  to  establish  the  crowd’s  ability 
to  react  to  mistakes  made  by  an  intelligent  agent  in  real  time, 
and  provide  accurate  guidance  on  a  preferred  alternative  ac¬ 
tion.  Our  work  informs  the  design  of  future  systems  that  use 
human  intelligence  to  guide  untrained  systems  through  the 
learning  proeess,  without  limiting  systems  lo  only  leam  from 
their  mistakes  far  after  they  make  them. 

The  contributions  of  this  paper  are  to; 

*  Present  the  idea  that  on-line  crowds  can  provide  very  accu¬ 
rate  assistance  to  learning  agents  by  using  real-time  data. 

*  Demonstrate  that  crowd  workers  can  respond  quickly  and 
accurately  enough  to  provide  just-in- lime  feedback. 

*  Show  that  workers  can  also  improve  their  accuracy  in  post 
hoc  review  settings  for  use  in  future  situations. 
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BACKGROUND  AND  RELATED  WORK 

Reinforcement  learning  has  a  history  of  succeeding  on  dif¬ 
ficult  problems  with  little  information.  This  paper  leverages 
the  on-policy  learning  algorithm  Sarsa  |7],  A  Saisa  agent 
learns  to  estimate  Q- values*  representing  the  estimated  total 
reward  the  agent  would  receive  in  a  state  a*  execute  action  ft* 
and  then  follows  the  current  policy  until  the  episode  is  termi¬ 
nated,  Over  time*  this  type  of  temporal  difference  learning 
allows  the  agent  to  learn  a  (near)  optimal  policy  that  collects 
as  much  reward  as  possible,  in  expectation, 

While  autonomous  methods  like  Sarsa  have  many  empirical 
successes  and  sound  theoretical  underpinnings*  if  a  human 
is  available  to  provide  useful  information*  the  agent  is  often 
able  to  learn  much  faster.  For  example*  a  user  could  make 
judgements  about  the  agent's  performance  by  providing  hu¬ 
man  reward  [3*  6|  or  providing  demonstrations.  Finally*  when 
a  human  can  temporarily  control  the  agent  to  perform  the  cor¬ 
rect  behavior,  learning  from  demonstration  techniques  [l]. 

Most  related  to  this  paper  is  existing  work  done  on  crowd¬ 
sourcing  control  and  recognition  tasks.  Human  control  of 
robots  has  been  previously  explored  in  the  context  of  a  robot 
Ouija  board  and  navigation  setting  |2|,  The  Robot  Man¬ 
agement  System  [  1 0 j  (RMS)  also  uses  on-line  contributors 
to  crowdsourcc  human-robot  interaction  studies.  RMS  used 
groups  of  participants  wrorking  from  their  home  computers  to 
practice  controlling  a  robot  using  camera  views  and  a  web- 
based  control  interface. 

Legion  |4]  explored  using  crowd  workers  to  collaboratively 
control  a  robot  in  real  time.  T  his  was  the  first  work  to  show 
that  on-demand  human  intelligence  could  be  used  to  control 
a  robot  when  an  automatic  system  is  unable  to  proceed.  Le¬ 
gion  :AR  \5]  showed  that  an  active  learning  approach  could  be 
used  in  an  activity  recognition  setting  to  call  on  crowd  sup¬ 
port  for  an  action- labeling  task  only  when  needed.  In  both 
systems,  low-latency  responses  were  achieved  by  keeping  the 
crowd  continuously  engaged  with  a  task  for  a  period  of  time, 
However*  complete  crowd  control  docs  not  let  the  system  ef¬ 
fectively  evaluate  its  own  policy.  In  this  work*  we  explore 
if  and  how  wc  can  use  real-time  crowds  in  an  advisory  role* 
without  needing  the  crowd  to  directly  control  the  Pac-Man 
avatar*  while  still  maintaining  exceptional  response  speeds. 

EXPERIMENTAL  DESIGN 

Our  Pac-Man  agent  (see  Figure  1)  used  Sarsa  to  learn  a 
near- optimal  policy  to  win  the  game  while  earning  as  many 
points  as  possible  using  an  existing  open  learning  implemen¬ 
tation  [81,  Due  to  the  large  state  space,  the  agent  uses  seven 
high-level  features  for  function  approximation  to  learn  a  con¬ 
tinuous  Q- value  function.  Pac-Man  code  is  available  from 
http: //hww.cgcs .  wsu . cdu/ 'taylorn/13PacKan . sip. 

To  generate  the  videos  used  in  the  user  study*  we  recorded 
Pac-Man  being  controlled  by  a  human  who  intentionally 
made  different  types  of  mistakes.  Then,  we  picked  10-14 
seconds  which  contained  one  (and  only  one)  mistake.  Q- 
values  for  the  agent’s  trajectory  were  also  recorded,  confirm¬ 
ing  that  the  human-crealed  mistakes  had  lower  Q- values  than 
the  “correct”  action.  We  create  lour  videos  where  each  con¬ 


tained  a  mistake:  Video  1 )  moving  so  that  Pac-Man  is  trapped 
by  one  or  more  ghosts,  Video  2)  not  moving  towards  an  edible 
ghost  after  eating  a  power  pill.  Video  3)  taking  an  empty  path 
instead  of  going  for  pills  when  they  are  no  risk*  and  Video  4) 
not  going  for  all  edible  ghosts  that  are  within  close  range. 

To  study  the  hypothesis  that  crowd  workers  can  provide  infor¬ 
mation  useful  to  reinforcement  learning  agents*  we  consider 
four  sellings.  First*  a  video  of  Pac-Man  is  played  only  once 
(real-time)  or  the  worker  can  view  it  multiple  limes  (reriVw). 
Second,  the  worker  may  be  asked  to  identify  the  time  at  which 
the  mistake  is  made  {Miyake  Identification)^  or  asked  to  iden¬ 
tify  both  the  mistake  time  as  well  as  suggest  the  optimal  ac¬ 
tion  {Action  Suggestion) 

We  want  to  measure  the  performance  of  users  in  identifying 
the  point  at  which  a  mistake  is  made  and  suggesting  optimal 
action  Pac-Man  should  have  executed.  To  evaluate  worker 
actions,  we  can  compare  to  recorded  Q- values. 

USER  STUDIES 

Workers  were  first  shown  instructions  describing  the  task,  as 
well  as  the  rules  of  Pac-Man,  During  a  preliminary  test  of 
the  web  interface*  wc  found  that  workers  would  sometimes 
identify  mistakes  before  the  sub-optimal  action  was  executed, 
anticipating  the  mistake.  Wc  provided  explicit  instructions 
to  workers  to  encourage  them  to  identify  the  exact  time  at 
which  a  mistake  was  made.  Workers  were  then  directed  to  a 
tutorial  which  asked  them  to  complete  an  example  task  using 
the  marking  interface.  After  the  tutorial,  workers  will  watch 
a  new  video  and  must  press  a  button  (sec  Figure  1)  as  soon  as 
they  observe  a  mistake. 

We  recruited  crowd  workers  from  Amazon  Mechanical  Turk 
(AMT)  for  our  experiments.  While  AMT  provides  immedi¬ 
ately,  programmatic  access  to  crowds*  it  also  poses  a  num¬ 
ber  of  challenges,  including  that  workers:  l)  arc  unlikely 
to  be  experts,  2)  may  not  take  the  task  seriously  and  not 
read  the  instructions,  and  3)  may  intentionally  select  incor¬ 
rect  times/actions.  Our  methods  need  to  be  robust  to  these 
challenges,  unlike  in  Learning  from  Demonstration,  where 
demonstrations  arc  typically  assumed  to  be  optimal. 

16  Human  Intelligence  Tasks  (HITs)  on  AMT  encompassed 
our  four  different  types  of  experiments.  Each  experiment  was 
tested  with  4  different  videos.  We  collected  data  from  30 
unique  workers  per  HIT  and  every  worker  was  paid  25  cents, 

RESULT  ANALYSIS 

This  section  presents  the  results  of  our  study  in  three  parts. 
First,  we  establish  that  the  crowd  can  identify  the  mistake 
with  high  accuracy.  Second,  we  demonstrate  that  not  only 
can  Mistake  Identification  be  done  in  real-time  but  that  work¬ 
ers  can  also  successfully  identify  what  the  optimal  “correct” 
move  would  have  been.  Third,  we  compare  the  crowdTs  per¬ 
formance  in  the  real-time  setting  with  offline  “review”  setting 
and  show  that  if  additional  time  is  available,  even  more  accu¬ 
rate  performance  can  be  achieved. 
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(a)  A  histogram  of  workers*  suggestions  us¬ 
ing  Video  I  (mistake  time:  986). 


(b)  A  histogram  of  workers'  suggestions  us¬ 
ing  Video  3  (mistake  lime:  1116). 


(c)  Number  of  limes  each  action  was  sug¬ 
gested  by  workers  (Optimal:  1-down,  2- 
down,  3-down,  4-up). 


Figure  2:  Selected  exemplar  results  from  our  16  Amazon  Mechanical  Turk  experiments. 


Mistake  Identification 

Our  performance  measure  is  based  on  how  many  workers  can 
correctly  identify  and  suggest  a  lime  that  is  dose  to  the  cor¬ 
rect  mistake  time.  Histograms  provide  a  visual  representation 
of  the  accuracy  Of  workers  in  different  settings.  The  mis¬ 
take  times  are  reported  as  game  move  numbers,  which  are 
986,  1809,  1116  and  334,  for  Videos  1=4,  respectively.  These 
video  clips  are  10  to  14  seconds  long,  corresponding  to  250— 
350  total  game  moves,  and  the  mistakes  located  roughly  three 
quarters  of  the  way  through  the  clip.  However,  because  Pac- 
Man  moves  continually,  it  is  difficult  for  workers  lo  identify 
the  exact  frame  when  the  mistake  was  executed. 

To  quantify  how  accurate  the  workers  were,  we  calculated  the 
difference  between  the  actual  mistake  time  and  the  identified 
mistake  lime,  where  /.ero  corresponds  to  a  perfect  answer.  We 
selected  a  threshold  of  30  actions,  roughly  I  second,  so  that 
any  answer  within  ±1  second  will  be  counted  as  correctly 
identifying  the  mistake.  Figure  2(a)  shows  the  distribution 
or  workers*  answers  where  responses  within  the  956-1016 
moves  range  are  considered  to  he  correct,  showing  only  two 
errant  responses. 

To  compute  the  mean  difference  between  the  time  reported  by 
a  worker  and  the  real  error  time  firfff*  we  use:  jyffiAMTf)  — 

,  where  k  is  the  group  number,  n  is  the  total 
number  of  workers  per  group  that  are  within  the  threshold, 
is  the  ith  worker’s  suggested  time,  and  t,„  is  the  correct 
mistake  time.  The  standard  deviation  is  also  computed  using: 

&m{AMTf)  =  w|lcrc  a  iow 

value  indicates  suggestions  arc  tightly  clustered. 

To  establish  that  workers  can  correctly  identify  where  and 
when  mistakes  occur  in  our  game,  we  count  the  number  of 
people  who  correctly  identified  the  mistake.  Video  I 's  review 
setting  collective  percentage  of  correct  events  has  the  highest 
over  all  four  videos  with  98.3%,  This  is  followed  by  Videos  4 
and  2,  with  88.0%  and  86.6%  respectively.  And  Video  3  has 
the  lowest  accuracy  with  68,4%,  This  observed  high  percent¬ 
age  of  correct  events  from  the  three  videos  suggests  that  the 
crowd  can  identify  a  mistake  in  many  cases. 


It  is  also  important  to  point  out  that  there  are  instances  in 
which  the  mistakes  are  more  subtle,  making  it  harder  to  iden¬ 
tify,  Video  3  has  the  lowest  accuracy,  and  the  Mistake  Iden¬ 
tification  experiment  has  the  least  percentage  of  correct  an¬ 
swers  at  56,7%,  However,  the  sparsity  of  the  data  as  shown 
in  Figure  2(b)  suggests  that  the  mistake  was  harder  to  find. 

In  summary,  these  results  established  that,  in  most  cases, 
workers  can  identify  a  mistake  in  the  Pac-Man  game  with  an 
overall  accuracy  for  review  Mistake  Identification  at  80%  and 
an  accuracy  of  91%  for  review  Action  Suggestion, 

Optimal  Action  Identification 

Given  that  workers  can  correctly  identify  mistakes,  we  next 
consider  whether  they  can  also  accurately  provide  the  action 
that  should  have  been  taken.  To  do  this,  we  first  have  lo  verify 
that  majority  of  the  workers  within  the  threshold  suggested 
I  he  same  action,  and  second,  the  suggested  action  has  the 
maximum  Q-value  in  the  recorded  video’s  game  state. 

Figure  2(c)  shows  that  all  workers*  suggested  actions  that  are 
within  the  30-movc  threshold,  in  both  the  real-time  and  re¬ 
view/  cases,  meaning  that  a  majority  of  workers  do  suggest 
similar  actions. 

Knowing  that  the  crowd  reaches  consensus  on  a  single  action, 
we  can  now  comp,  ire  the  crowd's  advice  lo  the  recorded  Q- 
valucs  of  the  game  to  verify  if  it  is  the  correct  (near-optimal) 
action.  The  maximum  Q- value  of  the  4  possible  Pac-Man 
actions  determines  what  action  Pac-Man  should  perform.  In 
Video  I ,  a  step  before  986  moves  should  suggest  the  Q- values 
for  the  next  move.  At  move  985,  the  Q-values  are:  up  — 
1729,  right  =  1621,  down  =  1768,  and  left  =  1621,  In 
the  human-controlled  game  in  Video  I ,  Pac-Man  went  right 
at  this  time  when  it  should  have  gone  down  (the  maximum 
Q-value).  And  as  shown  in  Figure  2(e),  workers  did  suggest 
for  Pac-Man  should  move  downward  in  Video  I .  Similar  in 
the  other  three  videos  demonstrate  that  workers  ean  identify 
that  a  mistake  has  been  made  but  as  well  as  provide  an  advice 
that  is  useful  and  near-optimal. 
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Real-time  vs.  Review 

We  expected  the  real-time  setting  to  be  considerably  harder 
than  review  setting.  This  assumption  can  be  verified  by  con¬ 
sidering  The  mean  difference  tor  each  setting  —  the  average 
mean  difference  for  real-time  setting  is  9. 1  moves  (ft  0  36 
seconds)  while  review  case  is  of  4,5  moves  (ft  0,18  seconds). 
The  lower  mean  difference  in  review  experiments  shows  that 
if  additional  time  is  available,  even  closer  estimates  of  the 
point  of  the  mistake  can  be  gathered. 

We  performed  a  4  x  2  Between  Subjects  Factorial  ANOVA 
test  of  all  Action  Suggestion  experiments  shows  that  the  dif¬ 
ference  of  suggested  mistake  time  by  workers  between  sub¬ 
jects  real-time  and  review  setting  was  statistically  significant 
{F  =  5.10,  p  <  .05,  if  =  ,023).  This  difference  between 
real-time  and  review  setting  in  all  Mistake  Identification  ex¬ 
periments  is  also  significant  (F  =  5.02,  p  <  .05,  rf  =  .022). 
This  indicates  that  the  different  mistakes  in  each  video  can 
also  affect  worker's  ability  to  identify  them. 

Interestingly,  there  is  only  a  small  difference  between  the 
mean  difference  of  real-time  and  review  setting  in  Mistake 
Identification  for  Video  2,  This  indicates  that  the  mistake  in 
Video  2  was  harder  for  workers  to  identify  than  the  mistakes 
in  the  other  three  videos. 

It  is  notable  here  that  the  average  of  mean  differences  in 
the  real-time  setting  for  Mistake  Identification  results  to  9,8 
moves  (ft  0.39  seconds),  and  with  Action  Suggestion  at  8.8 
moves  (ft  0.35  seconds),  which  are  both  very  close  to  the  hu¬ 
man  response  for  tasks  with  no  high-level  reasoning  needed 
(e,g„  clicking  a  button  in  response  to  a  visual  stimulus).  This 
suggests  that  crowd  advice  for  tasks,  such  as  navigation,  can 
be  collected  nearly  as  fast  as  people  can  physically  respond, 
This  qu  ick  I  y-ava  liable  input  can.  in  turn,  be  used  to  improve 
real-time  learning  of  virtual  and  physical  agents, 

FUTURE  WORK 

Future  work  will  focus  on  developing  learning  algorithms 
that  to  leverage  the  unique  strengths  of  human  input  un-the- 
fiy  without  being  detrimentally  affected  by  incorrect  advice. 
Although  others  |9]  have  incorporated  advice  from  multiple 
demonstrators  in  past  work,  errors  from  crowdsourced  work¬ 
ers  are  a  unique  challenges  and  opportunities  to  scale  these 
systems.  Furthermore,  we  plan  to  continue  to  improve  our 
interfaces  to  further  reduce  the  latency  of  worker  responses. 
One  potential  method  to  do  this  is  to  leverage  workers'  abil¬ 
ity  to  predict  when  mistakes  might  be  made,  which  we  ini¬ 
tially  observed,  to  collectively  decrease  latency  below  the 
best  after-the-fact  response  speed  possible.  We  are  also  in¬ 
terested  in  studying  how  the  number  of  examples  during  the 
tutorial  affects  participants’  accuracy.  Finally,  we  are  inter¬ 
ested  in  eliciting  a  eonlidenee  measure  from  workers,  poten¬ 
tially  allowing  us  to  weight  different  pieces  of  advice. 

CONCLUSION 

Rem  for  cement  learning  algorithms  often  suffer  from  poor 
early-siage  performance  since  agents  have  to  experience  con¬ 
siderable  amount  of  trial- and -error  before  learning  an  effec¬ 
tive  policy.  Our  approach  uses  real-time  crowds  to  provide 
immediate  assistance  to  the  learning  agent  to  help  improve  its 


performance.  We  ran  a  set  of  user  studies  to  show  that  crowd 
workers  from  Amazon  Mechanical  Turk  can  respond  quickly 
and  accurately  enough  to  provide  just-in-time  feedback  to  an 
agent  playing  Pae-Man.  We  show  that  workers  can  correctly 
identify  the  point  at  which  a  mistake  is  made  by  Pac-Man  and 
the  optimal  action  Pae-Man  should  have  executed.  We  also 
showed  that  higher  performance  could  be  achieved  by  work¬ 
ers  in  post  hoc  review  settings. 

Gui  results  demonstrated  that  I )  crowd  workers  are  able  to 
accurately  choose  the  mistake  time  in  real-time  with  a  mean 
latency  of  just  0  39s,  and  2)  latency  does  not  increase  if  work¬ 
ers  must  also  suggest  an  action.  By  leveraging  the  crowd,  we 
present  an  effective,  scalable  means  of  providing  du ring-task 
assistance  to  learning  agents. 
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List  of  Symbols,  Abbreviations,  and  Acronyms 


AAAI:  AAAI  Conference  on  Artificial  Intelligence 
ELLA:  Efficient  Lifelong  Learning  Algorithm 
HIT:  Human  Intelligence  Task 

ICML:  International  Conference  on  Machine  Learning 

IJCAI:  International  Joint  Conference  on  Artificial  Intelligence 

IROS:  International  Conference  on  Intelligent  Robots  and  Systems 

IUI:  Association  of  Computing  Machinery  Conference  on  Intelligent  User  Interfaces 

MDP:  Markov  Decision  Process 

MTL:  Multi-task  Learning 

PG-ELLA:  Policy  Gradient  ELLA 

Q-Leaming:  Q-value  Learning,  a  type  of  TD-Leaming 

RL:  reinforcement  learning 

SDM:  sequential  decision-making 

TD-Leaming:  Temporal  Difference  Learning,  a  type  of  RL  algorithm 
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